bayesmix/protos

Protos

This library depends on Google’s Protocol Buffers, also known as protobuf, which provides a convenient way to define classes that represent structured data. Special classes henceforth referred to as protobuf messages, or protos for short, can be defined in .proto files. A special compiler, protoc, is automatically called by the library to generate C++ and/or Python classes for each message. The protobuf runtime library provides fast serialization of messages into bytes, which can be used to save objects to disk or pass serialized objects from one language to another.

A description of all protos used in bayesmix follows. These range from simple enumerator identifiers (enums) and basic data types such as vectors or matrices, to objects representing probability distributions, hyperpriors, states, or hyperparameter values. Some of these protos are embedded in one another, possibly using the oneof keyword, which allows the outer proto to flexibly choose and contain one type of object among many different ones. For instance, this is the case with protos representing hyperpriors, which can have increasing degrees of complexity depending on which model is chosen by the user.

The use of protos allows easy interface between multiple programming languages, as well as a posteriori analysis of MCMC chains.

Protocol Documentation

Protocol Documentation

Table of Contents

algorithm_id.proto

Top

AlgorithmId

Enum for the different types of algorithms.

References

[1] R. M. Neal, Markov Chain Sampling Methods for Dirichlet Process Mixture Models. JCGS(2000)

[2] H. Ishwaran and L. F. James, Gibbs Sampling Methods for Stick-Breaking Priors. JASA(2001)

[3] S. Jain and R. M. Neal, A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model. JCGS (2004)

[4] M. Kalli, J. Griffin and S. G. Walker, Slice sampling mixture models. Stat and Comp. (2011)

Name Number Description
UNKNOWN_ALGORITHM 0

Neal2 1

Neal's Algorithm 2, see [1]

Neal3 2

Neal's Algorithm 3, see [1]

Neal8 3

Neal's Algorithm 8, see [1]

BlockedGibbs 4

Ishwaran and James Blocked Gibbs, see [2]

SplitMerge 5

Jain and Neal's Split&Merge, see [3]. NOT IMPLEMENTED YET!

Slice 6

Slice sampling, see [4]. NOT IMPLEMENTED YET!

algorithm_params.proto

Top

AlgorithmParams

Parameters used in the BaseAlgorithm class and childs.

Field Type Label Description
algo_id string

Id of the Algorithm. Must match the ones in the AlgorithmId enum

rng_seed uint32

Seed for the random number generator

iterations uint32

Total number of iterations of the MCMC chain

burnin uint32

Number of iterations to discard as burn-in

init_num_clusters uint32

Number of clusters to initialize the algorithm. It may be overridden by conditional mixings for which the number of components is fixed (e.g. TruncatedSBMixing). In this case, this value is ignored.

neal8_n_aux uint32

Number of auxiliary unique values for the Neal8 algorithm

splitmerge_n_restr_gs_updates uint32

Number of restricted GS scans for each MH step.

splitmerge_n_mh_updates uint32

Number of MH updates for each iteration of Split and Merge algorithm.

splitmerge_n_full_gs_updates uint32

Number of full GS scans for each iteration of Split and Merge algorithm.

algorithm_state.proto

Top

AlgorithmState

This message represents the state of a Gibbs sampler for

a mixture model. All algorithms must be able to handle this

message, by filling it with the current state of the sampler

in the `get_state_as_proto` method.

Field Type Label Description
cluster_states AlgorithmState.ClusterState repeated

The state of each cluster

cluster_allocs int32 repeated

Vector of allocations into clusters, one for each observation

mixing_state MixingState

The state of the `Mixing`

iteration_num int32

The iteration number

hierarchy_hypers AlgorithmState.HierarchyHypers

The current values of the hyperparameters of the hierarchy

AlgorithmState.ClusterState

Field Type Label Description
uni_ls_state UniLSState

State of a univariate location-scale family

multi_ls_state MultiLSState

State of a multivariate location-scale family

lin_reg_uni_ls_state LinRegUniLSState

State of a linear regression univariate location-scale family

general_state Vector

Just a vector of doubles

fa_state FAState

State of a Mixture of Factor Analysers

cardinality int32

How many observations are in this cluster

AlgorithmState.HierarchyHypers

Field Type Label Description
general_state Vector

nnig_state NIGDistribution

nnw_state NWDistribution

lin_reg_uni_state MultiNormalIGDistribution

nnxig_state NxIGDistribution

fa_state FAPriorDistribution

distribution.proto

Top

BetaDistribution

Parameters defining a beta distribution

Field Type Label Description
shape_a double

shape_b double

GammaDistribution

Parameters defining a gamma distribution with density

f(x) = x^(shape-1) * exp(-rate * x) / Gamma(shape)

Field Type Label Description
shape double

rate double

InvWishartDistribution

Parameters defining an Inverse Wishart distribution

Field Type Label Description
deg_free double

scale Matrix

MultiNormalDistribution

Parameters defining a multivariate normal distribution

Field Type Label Description
mean Vector

var Matrix

MultiNormalIGDistribution

Parameters for the Normal Inverse Gamma distribution commonly employed in

linear regression models, with density

f(beta, var) = N(beta | mean, var * var_scaling^{-1}) * IG(var | shape, scale)

Field Type Label Description
mean Vector

var_scaling Matrix

shape double

scale double

NIGDistribution

Parameters of a Normal Inverse-Gamma distribution

with density

f(x, y) = N(x | mu, y/var_scaling) * IG(y | shape, scale)

Field Type Label Description
mean double

var_scaling double

shape double

scale double

NWDistribution

Parameters of a Normal Wishart distribution

with density

f(x, y) = N(x | mu, (y * var_scaling)^{-1}) * IW(y | deg_free, scale)

where x is a vector and y is a matrix (spd)

Field Type Label Description
mean Vector

var_scaling double

deg_free double

scale Matrix

NxIGDistribution

Parameters of a Normal x Inverse-Gamma distribution

with density

f(x, y) = N(x | mu, var) * IG(y | shape, scale)

Field Type Label Description
mean double

var double

shape double

scale double

UniNormalDistribution

Parameters defining a univariate normal distribution

Field Type Label Description
mean double

var double

hierarchy_id.proto

Top

HierarchyId

Enum for the different types of Hierarchy.

Name Number Description
UNKNOWN_HIERARCHY 0

NNIG 1

Normal - Normal Inverse Gamma

NNW 2

Normal - Normal Wishart

LinRegUni 3

Linear Regression (univariate response)

LapNIG 4

Laplace - Normal Inverse Gamma

FA 5

Factor Analysers

NNxIG 6

Normal - Normal x Inverse Gamma

hierarchy_prior.proto

Top

EmptyPrior

Field Type Label Description
fake_field double

FAPrior

Field Type Label Description
fixed_values FAPriorDistribution

FAPriorDistribution

Field Type Label Description
mutilde Vector

beta Vector

phi double

alpha0 double

q uint32

LapNIGPrior

Field Type Label Description
fixed_values LapNIGState

LapNIGState

Prior for the parameters of the base measure in a Laplace - Normal Inverse Gamma hierarchy

Field Type Label Description
mean double

var double

shape double

scale double

mh_mean_var double

mh_log_scale_var double

LinRegUniPrior

Prior for the parameters of the base measure in a Normal mixture model with a covariate-dependent

location.

Field Type Label Description
fixed_values MultiNormalIGDistribution

NNIGPrior

Prior for the parameters of the base measure in a Normal-Normal Inverse Gamma hierarchy

Field Type Label Description
fixed_values NIGDistribution

no prior, just fixed values

normal_mean_prior NNIGPrior.NormalMeanPrior

prior on the mean

ngg_prior NNIGPrior.NGGPrior

prior on the mean, var_scaling, and scale

NNIGPrior.NGGPrior

Field Type Label Description
mean_prior UniNormalDistribution

var_scaling_prior GammaDistribution

shape double

scale_prior GammaDistribution

NNIGPrior.NormalMeanPrior

Field Type Label Description
mean_prior UniNormalDistribution

var_scaling double

shape double

scale double

NNWPrior

Prior for the parameters of the base measure in a Normal-Normal Wishart hierarchy

Field Type Label Description
fixed_values NWDistribution

no prior, just fixed values

normal_mean_prior NNWPrior.NormalMeanPrior

prior on the mean

ngiw_prior NNWPrior.NGIWPrior

prior on the mean, var_scaling, and scale

NNWPrior.NGIWPrior

Field Type Label Description
mean_prior MultiNormalDistribution

var_scaling_prior GammaDistribution

deg_free double

scale_prior InvWishartDistribution

NNWPrior.NormalMeanPrior

Field Type Label Description
mean_prior MultiNormalDistribution

var_scaling double

deg_free double

scale Matrix

NNxIGPrior

Prior for the parameters of the base measure in a Normal-Normal x Inverse Gamma hierarchy

Field Type Label Description
fixed_values NxIGDistribution

no prior, just fixed values

ls_state.proto

Top

FAState

Field Type Label Description
mu Vector

psi Vector

eta Matrix

lambda Matrix

LinRegUniLSState

Parameters of a univariate linear regression

Field Type Label Description
regression_coeffs Vector

regression coefficients

var double

variance of the noise

MultiLSState

Parameters of a multivariate location-scale family of distributions,

parameterized by mean and precision (inverse of variance). For

convenience, we also store the Cholesky factor of the precision matrix.

Field Type Label Description
mean Vector

prec Matrix

prec_chol Matrix

UniLSState

Parameters of a univariate location-scale family of distributions.

Field Type Label Description
mean double

var double

matrix.proto

Top

Matrix

Message representing a matrix of doubles.

Field Type Label Description
rows int32

number of rows

cols int32

number of columns

data double repeated

matrix elements

rowmajor bool

if true, the data is read in row-major order

Vector

Message representing a vector of doubles.

Field Type Label Description
size int32

number of elements in the vector

data double repeated

vector elements

mixing_id.proto

Top

MixingId

Enum for the different types of Mixing.

Name Number Description
UNKNOWN_MIXING 0

DP 1

Dirichlet Process

PY 2

Pitman-Yor Process

LogSB 3

Logit Stick-Breaking Process

TruncSB 4

Truncated Stick-Breaking Process

MFM 5

Mixture of finite mixtures

mixing_prior.proto

Top

DPPrior

Prior for the concentration parameter of a Dirichlet process

Field Type Label Description
fixed_value DPState

No prior, just a fixed value

gamma_prior DPPrior.GammaPrior

Gamma prior on the total mass

DPPrior.GammaPrior

Field Type Label Description
totalmass_prior GammaDistribution

LogSBPrior

Definition of the parameters of a Logit-Stick Breaking process.

Field Type Label Description
normal_prior MultiNormalDistribution

Normal prior on the regression coefficients

step_size double

Steps size for the MALA algorithm used for posterior inference (TODO: move?)

num_components uint32

Number of components in the process

MFMPrior

Prior for the Poisson rate and Dirichlet parameters of a MFM (Finite Dirichlet) process.

For the moment, we only support fixed values

Field Type Label Description
fixed_value MFMState

No prior, just a fixed value

PYPrior

Prior for the strength and discount parameters of a Pitman-Yor process.

For the moment, we only support fixed values

Field Type Label Description
fixed_values PYState

TruncSBPrior

Definition of the parameters of a truncated Stick-Breaking process

Field Type Label Description
beta_priors TruncSBPrior.BetaPriors

General stick-breaking distributions

dp_prior TruncSBPrior.DPPrior

Truncated Dirichlet process

py_prior TruncSBPrior.PYPrior

Truncated Pitman-Yor process

mfm_prior TruncSBPrior.MFMPrior

num_components uint32

Number of components in the process

TruncSBPrior.BetaPriors

Field Type Label Description
beta_distributions BetaDistribution repeated

General stick-breaking distributions

TruncSBPrior.DPPrior

Field Type Label Description
totalmass double

Truncated Dirichlet process

TruncSBPrior.MFMPrior

Field Type Label Description
totalmass double

Truncated Dirichlet process

TruncSBPrior.PYPrior

Field Type Label Description
strength double

Truncated Pitman-Yor process

discount double

mixing_state.proto

Top

DPState

State of a Dirichlet process

Field Type Label Description
totalmass double

the total mass of the DP

LogSBState

State of a Logit-Stick Breaking process

Field Type Label Description
regression_coeffs Matrix

Num_Components x Num_Features matrix. Each row is the regression coefficients for a component.

MFMState

State of a MFM (Finite Dirichlet) process

Field Type Label Description
lambda double

rate parameter of Poisson prior on number of compunents of the MFM

gamma double

parameter of the dirichlet distribution for the mixing weights

MixingState

Wrapper of all possible mixing states into a single oneof

Field Type Label Description
dp_state DPState

py_state PYState

log_sb_state LogSBState

trunc_sb_state TruncSBState

mfm_state MFMState

PYState

State of a Pitman-Yor process

Field Type Label Description
strength double

discount double

TruncSBState

State of a truncated sitck breaking process. For convenice we store also the logarithm of the weights

Field Type Label Description
sticks Vector

logweights Vector

mixture_model.proto

Top

HierarchyPrior

Field Type Label Description
nnig_prior NNIGPrior

lapnig_prior LapNIGPrior

nnw_prior NNWPrior

lin_reg_prior LinRegUniPrior

fa_prior FAPrior

MixingPrior

Field Type Label Description
dp_prior DPPrior

py_prior PYPrior

log_sb_prior LogSBPrior

trunc_sb_prior TruncSBPrior

MixtureModel

Field Type Label Description
mixing MixingId

hierarchy HierarchyId

mixing_prior MixingPrior

hierarchy_prior HierarchyPrior

semihdp.proto

Top

SemiHdpParams

Field Type Label Description
pseudo_prior SemiHdpParams.PseudoPriorParams

dirichlet_concentration double

rest_allocs_update string

Either "full", "metro_base", "metro_dist"

totalmass_rest double

totalmass_hdp double

w_prior SemiHdpParams.WPriorParams

SemiHdpParams.PseudoPriorParams

Field Type Label Description
card_weight double

mean_perturb_sd double

var_perturb_frac double

SemiHdpParams.WPriorParams

Field Type Label Description
shape1 double

shape2 double

SemiHdpState

Field Type Label Description
restaurants SemiHdpState.RestaurantState repeated

groups SemiHdpState.GroupState repeated

taus SemiHdpState.ClusterState repeated

c int32 repeated

w double

SemiHdpState.ClusterState

Field Type Label Description
uni_ls_state UniLSState

multi_ls_state MultiLSState

lin_reg_uni_ls_state LinRegUniLSState

cardinality int32

SemiHdpState.GroupState

Field Type Label Description
cluster_allocs int32 repeated

SemiHdpState.RestaurantState

Field Type Label Description
theta_stars SemiHdpState.ClusterState repeated

n_by_clus int32 repeated

table_to_shared int32 repeated

table_to_idio int32 repeated

Scalar Value Types

.proto Type Notes C++ Java Python Go C# PHP Ruby
double double double float float64 double float Float
float float float float float32 float float Float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int int32 int integer Bignum or Fixnum (as required)
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long int64 long integer/string Bignum
uint32 Uses variable-length encoding. uint32 int int/long uint32 uint integer Bignum or Fixnum (as required)
uint64 Uses variable-length encoding. uint64 long int/long uint64 ulong integer/string Bignum or Fixnum (as required)
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int int32 int integer Bignum or Fixnum (as required)
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long int64 long integer/string Bignum
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int uint32 uint integer Bignum or Fixnum (as required)
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long uint64 ulong integer/string Bignum
sfixed32 Always four bytes. int32 int int int32 int integer Bignum or Fixnum (as required)
sfixed64 Always eight bytes. int64 long int/long int64 long integer/string Bignum
bool bool boolean boolean bool bool boolean TrueClass/FalseClass
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode string string string String (UTF-8)
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte ByteString string String (ASCII-8BIT)