A BIVARIATE DENSITY ESTIMATION PROBLEM

As a second illustration of the mixture model software, I generated
bivariate real data from a mixture of two component distributions,
with probabilities 0.3 and 0.7.  These two distributions were not
exactly Gaussian, and the two real variables were not exactly
independent within one of these components.  Accordingly, modeling
this data well with a mixture of Gaussian distributions will require
more than two components in the mixture.  

For the exact distribution used, see the source of the generation
program, in rgen.c.  I generated 1000 cases with this program, stored
in 'rdata', of which the first 500 are used for training (the others
are not used for anything at the moment).


A two-component mixture model for the density estimation problem.

We can first see what happens when we model this data with a mixture
of two Gaussians - even though we know the data cannot be perfectly
modeled in this way.  We specify this two-component model using the
'mix-spec' and 'model-spec' commands, as follows:

    > mix-spec rlog.2 0 2 2 / 1 0.05:0.5:0.2 10
    > model-spec rlog.2 real 0.05:0.5:0.5:1

The 'mix-spec' command creates the log file "rlog.2".  The arguments
following the log file name are the number of input attributes in a
case (always 0 at present), the number of target attributes (2 for
this bivariate problem), and the number of mixture components to use
(2 for this model).  

The Dirichlet concentration parameter follows the "/".  In this model,
its value is 1 (unscaled, since there's no 'x'), which produces a
uniform prior for the single number determining the probabilities of
the two components.

The "offset" parameters of the two components represent the Gaussian
means when modeling real data.  Hyperparameters determine the prior
means and standard deviations of these offsets (separately for the two
target attributes); priors for these hyperparameters are specified in
'mix-spec'.  In the above command, the prior for the mean of an offset
is Gaussian with standard deviation 10 (the last argument).  The
standard deviations for the offsets are given a hierarchical prior,
with a higher-level hyperparameter common to both the lower-level
standard deviations.  The top-level precision (standard deviation to
the power -2) is given a Gamma prior with mean 0.05 and shape
parameter 0.5; the precisions for the lower-level hyperparameters have
Gamma priors with mean given by the higher-level precision, and shape
parameter 0.2.  This is all specified by the second-to-last argument
of 'mix-spec'.

A similar hierarchical scheme is used for the "noise" standard
deviations (the standard deviations of the Gaussian distributions in
the mixture), except that this scheme has three levels - a top-level
hyperparameter, a hyperparameter for each target attribute, and
hyperparameters for each target for each component.  The 'model-spec'
command gives the top-level mean, and the shape parameters for the
Gamma priors going down the hierarchy.

We next specify where the data comes from, with 'data-spec':

    > data-spec rlog.2 0 2 / rdata@1:500 . 

This says that there are 0 input attributes and 2 target attributes.

For this finite model, we can specify that all the Markov chain
updates should be done with Gibbs sampling, as follows:

    > mc-spec rlog.2 repeat 20 gibbs-indicators gibbs-params gibbs-hypers

The "repeat 20" just repeats these operations in a single iteration,
to reduce the volume of data stored in the log file.

Finally, we run the Markov chain simulation for 100 iterations:

    > mix-mc rlog.2 100

This takes about three seconds on our 550MHz Pentium III.  Once it has
finished, we can look at the hyperparameters and component parameters
at various iterations.  The last iteration should look something like
the following:

    > mix-display rlog.2

    MIXTURE MODEL IN FILE "rlog.2" WITH INDEX 100
    
    HYPERPARAMETERS
    
    Standard deviations for component offsets:
    
        0.068:     3.259    7.159
    
    Means for component offsets:
    
                  -1.408  +17.102
    
    Standard deviations for Gaussian target distributions:
    
        0.173:     0.201    4.680
    
    
    PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE
    
       1: 0.706   -2.052  +11.597
    
                   1.034    5.355
    
       2: 0.294   +1.751  +20.920
    
                   1.115    6.476

We see above that the two components are associated with fractions of
approximately 0.7 and 0.3 of the training cases, as expected from the
way the data was generated.  For each component, the two offset
parameters, giving the Gaussian means, are shown on the first line,
and the standard deviation parameters on the following line.  These
component parameters are approximately what we would expect from
looking at a plot of the data, but of course the two Gaussian
components cannot perfectly model the actual distribution.


An infinite mixture model for the density estimation problem.

To more closely approximate the true distribution, we can use a
mixture model with a countably infinite number of Gaussian components.
An infinite mixture is used if we simply omit the argument giving the
number of components in the 'mix-spec' command.  We must also change
the specification for the Dirichlet concentration parameter, preceding
it with an 'x' to indicate that it should be scaled so as to produce a
sensible infinite limit.  In the specification below, the moderate
value of 5 is chosen for this specification in order to indicate that
we believe that a fairly large number of components will have
substantial probability (the other prior specifications are the same
as before):

    > mix-spec rlog.inf 0 2 / x5 0.05:0.5:0.2 10

The 'model-spec' and 'data-spec' commands are the same as before:

    > model-spec rlog.inf real 0.05:0.5:0.5:1
    > data-spec rlog.inf 0 2 / rdata@1:500 . 

The 'mc-spec' command must be altered, however, since it is not
possible to do Gibbs sampling for component indicators when there are
an infinite number of components.  The met-indicators operation is
used instead, with 10 changes being proposed to every indicator:

    > mc-spec rlog.inf repeat 20 met-indicators 10 gibbs-params gibbs-hypers

We can now run the simulation for 100 iterations (which takes about
half a minute our 550MHz Pentium III):

    > mix-mc rlog.inf 100

If we now examine the state with 'mix-display', we will find that
quite a few (eg, 20) mixture components are associated with training
cases - though fewer components account for the bulk of the cases.

We can see how well the model has captured the true distribution by
generating a sample of cases from a distribution drawn from the
posterior, as represented by the state at a particular iteration.  We
do this as follows:

    > mix-cases rlog.inf 100 new 1000

This command generates 1000 new cases based on iteration 100, and
stores them (one per line) in the file "new".  We can now use a plot
program to view a scatterplot of the data in "new", and compare it
with a scatterplot of data from the actual distribution.  Note that
the data in "new" is taken jointly from one example of a distribution
from the posterior distribution.  If 'mix-cases' is called for another
iteration, it will produce data from a different distribution from the
posterior, which in general could be quite different.  This variation
represents the uncertainly regarding the true distribution that
remains when only a finite amount of training data is available.  A
representation of the predictive distribution for a single new data
point, which is the average of distributions drawn from the posterior,
could be obtained by combining the output of 'mix-cases' for a number
of iterations.