A SIMPLE REGRESSION PROBLEM

As a first example, we will look at a simple regression problem, in
which there is one real-valued input for each case, and one
real-valued target, whose value is to be predicted.  This example will
also illustrate some important issues with usage of Markov chain Monte
Carlo.

I generated synthetic data of this type in which the input variable,
x, for each case had a standard Gaussian distribution and the
corresponding target value came from a Gaussian distribution with
standard deviation 0.1 and mean given by

         0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2)

I generated 1100 cases in total, stored in the file 'rdata'.  Each
case consists of a line containing first the input value and then the
target value.  The first 100 of these cases are meant for training,
and the remaining 1000 for testing.  The training cases are plotted in
rdata-train.png and the test caes are plotted in rdata-test.png.


A neural network model for the regression problem.

We will model this data using a multilayer perceptron with one input
unit, one hidden layer of eight tanh units, and a single output unit
whose activation function is the identity.  The value of the output
unit will be taken as the mean of a Gaussian distribution for the
target, with the standard deviation of this Gaussian (the noise level)
being a hyperparameter to be estimated along with the parameters of
the network.  We will also use hyperparameters to express the prior
distributions for the network parameters.  Specifically, we will use
one hyperparameter for the input-to-hidden weights, one for the hidden
unit biases, and one for the hidden-to-output weights.  The output
unit bias will be given a simple Gaussian prior, with no adjustable
hyperparameter.  (The role of hyperparameters is primarily to
introduce dependencies between parameters, so they are usually not
used when they would control just a single parameter.)

The first step in applying this model to the data is to create a log
file containing the specifications for the network architecture and
the priors to use for the network parameters.  This can be done using
the following command:

    > net-spec rlog.net 1 8 1 / ih=0.05:0.5 bh=0.05:0.5 ho=x0.05:0.5 bo=100

Here, "rlog.net" is the name of the new log file, and the arguments
"1", "8", and "1", specify the numbers of input, hidden, and output
units.  Following the "/", the priors for the various groups of
network parameters are given, with the groups identified by
abbreviations such as "ih" for "input-to-hidden" weights (see
net-spec.doc for the complete list).  The groups in the above command
that are for the input-hidden weights, the hidden biases, the
hidden-to-output weights, and the output bias.

In general, the prior specifications used in the net-spec command
consist of a "width" value followed by up to three "alpha" values,
with perhaps an option character tacked on to the front.  For the full
details, see net-models.PDF and prior.doc.  Here, I will just comment
on the particular priors used above.

The prior specification used for the output bias is simply "100",
which means that the bias has a Gaussian prior with mean zero and
standard deviation 100.  The prior specifications of the form
"0.05:0.5" indicate that the parameters in these groups are associated
with a hyperparameter, which gives the standard deviation of a
Gaussian prior for these parameters.  The hyperparameter itself has a
rather vague prior that spans several orders of magnitude around one.
The inverse gamma priors used are somewhat difficult to visualize,
because their tails are asymmetrical, but some standard choices are
often appropriate.  Here, the "0.5" after the colon controls how vague
the prior is (closer to zero is more vague).  The "0.05" specifies the
location of this vague distribution, but due to the asymmetry of the
tails, it is closer to being the lower limit of the prior than the
centre (for vague priors such as this).

The "x" in front of the prior for the hidden-to-output weights
indicates that the prior should be automatically rescaled based on the
number of hidden units, so as to produce an effect that is independent
of the number of hidden units (in the limit of large numbers).

Once the network has been specified, we need to say how the network
outputs will be used to model the targets (response variables) in the
data set.  We do this with the 'model-spec' command:

    > model-spec rlog.net real 0.05:0.5

In this case, the targets are real-valued, and are modeled as the
network output plus Gaussian noise, with the noise standard deviation
being a hyperparameter having the prior given by the last argument of
the command.  The syntax of the prior specification is the same as for
the priors on network parameters. 

You can view the architecture and prior specifications stored in the
log file by invoking 'net-spec' with just a log file argument.  In
this example, this should give the following result:

    > net-spec rlog.net

    Network Architecture:
    
      Input layer:     size 1
      Hidden layer 0:  size 8  tanh
      Output layer:    size 1
    
    
    Prior Specifications:
    
             Hidden Layer 0
    
      Input-Hidden Weights:    0.050:0.50
      Hidden Biases:           0.050:0.50
    
             Output Layer
    
      Hidden0-Output Weights: x0.050:0.50
      Output Biases:           100.000

You can also view the model specification by invoking 'model-spec'
with just one argument giving the log file.

Once the network and data model have been specified, we need to
specify the data sets to be used for training and (optionally) for
testing.  We do this using the 'data-spec' command:

    > data-spec rlog.net 1 1 / rdata@1:100 . rdata@101: .
    Number of training cases: 100
    Number of test cases: 1000

Here, "rlog.net" is the log file we created with 'net-spec', to which
the data specifications will be appended.  The "1" and "1" arguments
give the numbers of inputs and targets.  These must be consistent with
the network architecture (if not, an error will be reported later when
you try to start the training).

After the "/", specifications for where to get the training and test
data are given.  Each such specification consists of two parts: the
source for the inputs, and the source for the targets.  The
specification "rdata@1:100" means that the training inputs come from
the file 'rdata', in lines 1 to 100, while the specification of
"rdata@101:" for the test inputs indicates that they also come from
the file 'rdata', but in lines 101 to the end of the file.  In the
above command, the sources for the targets are given as just ".",
which means the target items are on the same lines as the inputs,
following the last input item.  We could have said that the targets
come from a completely different file, however.  It is also possible
to specify exactly where on a line the inputs and targets are located
(and hence to ignore some items in the file).  For documentation on
these and other features, see numin.doc.

Though it is not done above, the 'data-spec' command also allows you
to specify transformations to be applied to the inputs or targets
before they are used.  This is useful, for example, if you wish to use
inputs that have been "standardized" to have mean zero and variance
one, based on the training data.  See data-spec.doc for the details.
In this example, the data already has approximately mean zero and
variance one, so the priors used are sensible without normalization.

In the training runs reported in the thesis, I used a short "initial
phase" to get things started, followed by a long "sampling phase" to
bring the simulation to equilibrium and then produce a sample of
networks from the posterior for use in prediction.  I still use the
same general procedure, but with some changes to how the initial phase
is done.

It seems desirable to start the simulation in a state where the
hyperparameters take on moderate values, and leave them fixed for a
few iterations so that the network parameters will also take on
moderate values.  This can be accomplished using the following
commands:

    > net-gen rlog.net fix 0.5
    > mc-spec rlog.net repeat 20 sample-noise heatbath hybrid 100:10 0.2
    > net-mc rlog.net 1

The 'net-gen' command stores a network in the log file with index
zero, in which the hyperparameters have values of 0.5, and the network
parameters are zero.  This is the initial state of the simulation run.
The following 'mc-spec' command specifies the Markov chain operations
to be performed in the initial phase.  Here, each iteration consists
of 20 repetitions of the following steps:  Gibbs sampling for the
noise level, a heatbath replacement of the momentum variables, and a
hybrid (Hamiltonian) Monte Carlo update with a trajectory 100 leapfrog
steps long, using a window of 10, and a stepsize adjustment factor of
0.2.  Note that the hyperparameters are not updated, and hence will
remain fixed at values of 0.5.  Finally, a single such iteration is
done by calling 'net-mc' with an iteration limit of 1.

The stepsize adjustment factor of 0.2 used above is typical of what is
needed, but will not be appropriate in all circumstances.  After the
'net-mc' command has finished, the number of the 20 hybrid Monte Carlo
updates that were rejected can be determined using the command
'net-plt t r rlog.net', which will write the iteration number (of 1)
and the rejection rate on standard output.  If the rejection rate is
high (say, over 0.3), a new run should be done using a smaller
stepsize adjustment factor.  In the initial phase, one would generally
start by guessing a value for the stepsize adjustment factor that is
on the low side, since there is no point in optimizing this choice.

At this point, we hope to have a network stored in the log file (with
index 1) that has values for both the parameters and hyperparameters
that are of moderate magnitude, and which have adapted at least
somewhat to the training data.  We can now start serious sampling with
the following commands:

    > mc-spec rlog.net sample-sigmas heatbath hybrid 2000:10 0.5
    > net-mc rlog.net 1000

The 'mc-spec' command appends a new set of Markov chain operations to
the log file, which will override the previous set.  These operations
are Gibbs sampling for both the hyperparameters and the noise level
(the "sigmas"), a heatbath update for the momentum variables, and a
hybrid Monte Carlo update with a trajectory 2000 leapfrog steps long,
a window of 10, and a stepsize adjustment factor of 0.5.  A long
trajectory length is typically desirable for the sampling phase.  As
in the initial phase, the stepsize adjustment factor of 0.5 used is
typical, but not universally applicable.  It may pay at this stage to
experiment in order to find out how large this factor can be while
keeping the rejection rate low.  The use of a "window" of around 10
states costs little and is often beneficial.

The 999 iterations of the sampling phase started with the command
'net-mc rlog.net 1000' take 17 seconds to complete on the the system
used (see Ex-test-system.doc).  If your computer is slower, you might
put the command in the background (or add a '&' to the end of the
'net-mc' command), and then monitor progress while you wait.  For
example, you can look at the last network saved in the log file (or
any earlier one) using 'net-display'.  Shortly after starting the run,
you might see the following:

    > net-display rlog.net

    Network in file "rlog.net" with index 150
    
    Input to Hidden Layer 0 Weights [1]
    
     0.48 0.48: +0.20  +0.51  +0.55  -0.16  +1.04  +1.02  +0.25  -0.87
    
    Hidden Layer 0 Biases [2]
    
          1.97: +0.48  -4.15  -1.84  -0.46  -1.10  +1.32  -0.31  -3.40
    
    Hidden Layer 0 to Output Weights [3]
    
     9.16 9.16: +9.68
    
          9.16:-12.70
    
          9.16: -4.17
    
          9.16: +1.96
    
          9.16: -6.27
    
          9.16: -5.19
    
          9.16:+19.00
    
          9.16: -5.24
    
    Output Biases [4]
    
        100.00:-18.77
    
    Noise levels
    
       0.10 -  0.10

This display of network parameters and hyperparameters is divided into
sections for different parameter groups.  Within each section, the
numbers before the colons are hyperparameters, those after are
parameters (weight and biases).  There are more hyperparameters shown
than were mentioned earlier, but for this network architecture, the
extra hyperparameters are either fixed in value (the 100 for output
biases), or tied to the value of a higher-level hyperparameter, so
they are effectively not present.

The parameter groups in the 'net-display' output are identified by
numbers in square brackets.  These can be used with the 'h', 'w', 
and 'W' quantities of 'net-plt'.  For example, to see how the
hyperparameter controlling the hidden-to-output weights has changed
during the simulation (so far), one can use the command

    > net-plt t h3 rlog.net | plot

where 'plot' is some suitable plot program, as discussed in
Ex-intro.doc.  (One can also just invoke net-plt and look at the
numbers printed on standard output.)  Here 'h3' refers to the
top-level hyperparameter for group 3, which is seen in the output of
'net-display' above to be the hidden-to-output group.  Here's a
command that shows how the individual weights in this group change
during the run:

    > net-plt t w3@ rlog.net | plot

In this case, the "plot" program must be one that is capable of
displaying more than one superimposed graph, with the data for each
graph being separated by a blank line.  Some plot programs may prefer
the data to come with multiple values per line, which is the format
produced by the 'net-tbl' command:

    > net-tbl tw3@ rlog.net | some-other-plot-program

Note that there is no space between the "t" quantity and the others in
this command.

By looking at plots of the hyperparameters and quantities such as the
squared error on the training set ('b'), one can get an idea of when
the simulation has reached equilibrium.  For this run, a plot of the
'b' quantity can be seen in rlog-b.png, and a plot of the three
hyperparameters (on a log scale) in rlog-h1h2h3.png.  Note that if you
run these commands and produce these plots, you might see something
different, due to differences in floating-point roundoff error.  See
below for a discussion of how the results you see might differ.

Networks from a point where equilibrium seems to have been reached can
be used to make predictions for test cases using the 'net-pred'
program.  From the plots above, it seems that equilibrium has been
reached after the first 100 iterations (or fewer).  So we could
discard the first 100 iterations as perhaps not coming from the
equilibrium distribution, and use the following command to use the
networks from the remaining 900 iterations to produce predictions for
all test cases, and report the average squared error:

    > net-pred itn rlog.net 101: 

    Number of iterations used: 900
    
    Case  Inputs Targets   Means Error^2
    
       1    0.75    1.74    1.75  0.0002
       2    1.21    1.21    1.14  0.0054
       3   -0.54    0.36    0.48  0.0157
    
            (middle lines omitted)

     998   -1.56    0.48    0.42  0.0033
     999    0.15    1.69    1.62  0.0043
    1000    0.74    1.77    1.76  0.0002
    
    Average squared error guessing mean:    0.01033+-0.00048

The options "itn" specified ask for a listing of the inputs ("i") and
targets ("t") for each case, along with the mean ("n") output for that
case of the 900 networks used for prediction.  The squared error when
using this mean to predict the target is shown for each case, and the
average squared error for the test cases is shown at the bottom, along
with its standard error with respect to the random selection of test
cases.  Considering that the average squared error with optimal
prediction is 0.01 (due to the noise of standard deviation 0.1 added
when generating the data), the network model has done quite well.

We can ask for just a summary of performance on test cases as follows:

    > net-pred npa rlog.net 101:

    Number of iterations used: 900

    Number of test cases: 1000

    Average log probability of targets:     0.854+-0.019
    Average squared error guessing mean:    0.01033+-0.00048

Here, the 'a' option gives only the averages (not predictions for
individual cases), and the 'p' option asks for performance in terms of
log probability of targets.

It is also possible to get predictions for cases that are not in the
test set that was specified with 'data-spec'.  For example:

    > net-pred nb rlog.net 101: / "%echo 2.3"
      +1.27448149e+00

Here, the options "nb" ask for only the predictive mean, with "bare"
output (no headings, also higher precision, in exponential format).
The argument at the end says that the inputs for test cases (here,
just one case) should be taken from the output of the Unix command
"echo 2.3", which just outputs the number 2.3.  Since no true value
for the target is provided, the squared error with this prediction
is not shown.

The 'net-pred' program can also find the median and the 10% and 90%
quantiles of the predictive distribution.  The program limits the
number of iterations that can be used when finding medians and
quantiles, so (though it's not actually necessary here) the command
below uses "%5" to look only at the 180 iterations above 100 with
numbers that are multiples of five:

    > net-pred itdq rlog.net 101:%5

    Number of iterations used: 180

    Case  Inputs Targets Medians |Error| 10% Qnt 90% Qnt

       1    0.75    1.74    1.75  0.0132    1.62    1.89
       2    1.21    1.21    1.14  0.0737    0.99    1.28
       3   -0.54    0.36    0.49  0.1266    0.35    0.62

                        (middle lines omitted)

     998   -1.56    0.48    0.42  0.0579    0.27    0.57
     999    0.15    1.69    1.62  0.0693    1.48    1.75
    1000    0.74    1.77    1.76  0.0148    1.62    1.90
    
    Average abs. error guessing median:     0.08091+-0.00192

When the median is used as the "best guess", performance is judged by
the average absolute error, not squared error, since this is the error
measure that is minimized by the true median.

These results seem quite good, so we might decide we're finished.
However, we seldom can be absolutely sure that our MCMC runs have
actually sampled well from the true posterior distribution.  To obtain
more confidence that our results are correct, we can run the Markov
chain for longer.  It's possible that this will reveal that we haven't
actually reached equilibrium.  And even if we are actually sampling
from the right distribution, a longer run will provide more accurate
Monte Carlo estimates, which would typically improve predictions at
least slightly.

We can extend the chain to 3000 iterations by just running net-mc
again with 3000 as the iteration limit:

    > net-mc rlog.net 3000

We can then look at the plots of squared error on the training set and
of the hyperpameters for this longer run.  These plots can be seen in
rlog-b-3000.png and rlog-h1h2h3-3000.png.  They reveal that shortly
after iteration 1000, the chain moves to states in which the squared
error on the training set is somewhat larger, and the hyperparameter
values are quite different!

We might conclude from this that the chain really only reaches
equilibrium after about 1050 iterations, and we should discard earlier
iterations when making predictions.  This gives results as follows:

    > net-pred npa rlog.net 1051:

    Number of iterations used: 1950

    Number of test cases: 1000

    Average log probability of targets:     0.805+-0.027
    Average squared error guessing mean:    0.02016+-0.00385

We see that performance is actually a bit worse than using iterations
from 101 to 1000.

We might decide to run for even longer, as follows:

    > net-mc rlog.net 50000

Plots of squared error on training cases and of hyperparameter values
for this long run are in rlog-b-50000.png and rlog-h1h2h3-50000.png.
We now see that the posterior distribution is multimodal.  Only one
mode was seen in iterations from 101 to 1000.  Soon after iteration
1000, we see the only other mode, up through iteration 3000.  But in
the run for 50000 iterations, we see movement back and forth between
the two modes, though more time is spent in one of the modes than in
the other.  Which mode was seen first in this run will have been due
to the particular random initialization done.  A different random
initialization, or even slightly different roundoff errors when using
a different computer system, might have led to the other mode being
seen first.

Note that states in the mode where the chain spends more time actually
have larger training set error on average (and consequently lower
posterior probability / higher energy).  This is possible if this mode
has a larger volume than the other mode.

The performance using all but the first 100 of the 50000 iterations is
as follows:

    > net-pred npa rlog.net 101:

    Number of iterations used: 49900

    Number of test cases: 1000

    Average log probability of targets:     0.830+-0.022
    Average squared error guessing mean:    0.01873+-0.00335

This is a bit better than when using states from iterations 1051 to
3000, but not as good as when using only states from iterations 101 to
1000.  The difference is due to predictions at points outside the
range of the training data.  We can see this using the following
command:

    > ( net-pred inb rlog.net 101:1000 / "%grid -3:3%0.01"
    >   echo " "
    >   net-pred inb rlog.net 1051:3000 / "%grid -3:3%0.01"
    > ) | plot

This finds the predictions for a grid of inputs from -3 to 3 using
first the iterations from 101 to 1000 (sampling from the first mode)
and then those from iterations 1051 to 3000 (sampling the second
mode).  The result can be seen in rlog-2-pred.png.

We can see that the predictions are very similar in the region where
there is training data (see rdata-train.png).  The test data (see
rdata-test.png) includes some points that are outside this region,
where the predictions using states from the two modes differ quite a
bit.

We can look in detail at the predictions at inputs -3, 0, and +3 using
the mode sampled in iterations 101 to 1000, the mode sampled in
iterations 1051 to 3000, and using both modes as sampled in iterations
101 to 50000:

    > net-pred inqQ rlog.net 101:1000 / "%echo -3; echo 0; echo 3"
    
    Number of iterations used: 900
    
    Case  Inputs   Means 10% Qnt 90% Qnt  1% Qnt 99% Qnt
    
       1   -3.00   -1.34   -2.45   -0.35   -3.51    0.51
       2    0.00    1.39    1.25    1.53    1.14    1.64
       3    3.00    2.27    1.69    2.87    1.13    3.26

    > net-pred inqQ rlog.net 1051:3000%2 / "%echo -3; echo 0; echo 3"

    Number of iterations used: 975

    Case  Inputs   Means 10% Qnt 90% Qnt  1% Qnt 99% Qnt

       1   -3.00   -2.19   -2.65   -1.75   -3.13   -1.42
       2    0.00    1.38    1.25    1.52    1.13    1.64
       3    3.00    3.02    2.69    3.33    2.44    3.62
    
    > net-pred inqQ rlog.net 101:50000%50 / "%echo -3; echo 0; echo 3"
    
    Number of iterations used: 998
    
    Case  Inputs   Means 10% Qnt 90% Qnt  1% Qnt 99% Qnt
    
       1   -3.00   -2.10   -2.61   -1.62   -3.12   -0.44
       2    0.00    1.38    1.25    1.52    1.13    1.64
       3    3.00    2.96    2.61    3.31    1.81    3.59

Here are the true values of the function at these inputs:

    > calc x=-3 "0.3 + 0.4*x + 0.5*Sin(2.7*x) + 1.1/(1+x^2)"
    -1.27494
    > calc x=0 "0.3 + 0.4*x + 0.5*Sin(2.7*x) + 1.1/(1+x^2)"
    1.4
    > calc x=3 "0.3 + 0.4*x + 0.5*Sin(2.7*x) + 1.1/(1+x^2)"
    2.09494

The predictions for input of 0 are all very similar, and quite good.

The predictions for inputs of -3 and +3 are much better when using
iterations from 101 to 1000 - the extrapolation beyond the training
data using parameters from this mode happens to better match the true
function.  This seems to mostly be luck, since absent strong prior
knowledge, how to extrapolate is just a guess.  When using iterations
1051 to 3000, the extrapolation is not good, and the true values lie
outside the interval from the 1% to the 99% quantiles - a sign of
overconfidence in the predictions.  However, when using all iterations
from 101 to 50000, the true values do lie within the interval from the
1% to 99% quantiles, showing that sampling both modes gives a better
indication of the true uncertainly (even though the mean prediction is
still not very good).


A Gaussian process model for the regression problem.

We can also model this data using a Gaussian process.  Such a model is
similar to a network model with an infinite number of hidden units.
The weights in this hypothetical infinite network are not represented
explicitly (fortunately, since this would require an infinite amount
of memory).  Only the hyperparameters are explicitly represented.

A Gaussian process model is specified using the gp-spec command, which
is analogous to the net-spec command.  For the simple regression
model, the following is one appropriate specification:

    > gp-spec rlog.gp 1 1 100 / 0.05:0.5 0.05:0.5

Here, "rlog.gp" is the name of the new log file that will hold the
results of the Gaussian process run.  The first two arguments
following the log file are the numbers of inputs and outputs,
respectively, both "1" for this problem.

The (optional) argument of "100" that follows is the prior for the
constant part of the covariance function used.  This corresponds to
the prior for the output unit bias in a network model.  A
specification for a linear part of the covariance could follow (but
doesn't here); it would correspond to a prior for direct input-output
connections in a network.  For reasons of computational accuracy, it
is best not to use too vague a prior for the constant part of the
covariance, even though that would not usually be a problem from a
statistical point of view.

The remaining arguments (after the "/") give the priors for the
hyperparameters used in an exponential term of the covariance
function.  These priors correspond to those for the hidden-output and
input-hidden weights in a network model.  (There is no counterpart
here to the prior for the hidden unit biases in a network model.)  The
first prior is for the scale of this term, which controls the
magnitude of the non-linear variation in the function.  The second
prior is for the relevance of the input, which controls the amount by
which the input has to change to produce a change in the non-linear
component of the function that is comparable to the overall scale over
which this component varies.  The prior specifications are in the same
form as is used for network specifications (see prior.doc).  The
specifications of "0.05:0.5" used here are vague, allowing these
hyperparameters to take on values over a wide range.

The specification can be viewed by invoking 'gp-spec' with just the
name of the log file:

    > gp-spec rlog.gp

    Number of inputs:    1
    Number of outputs:   1

    Constant part of covariance:  100.000

    Exponential parts of covariance:

       Scale           Relevance            Power   Flags

       0.050:0.50      0.050:0.50           2.000 

Once the Gaussian process model for functions has been specified, we
can specify how the function values are used to model the targets in
the dataset using 'model-spec', in exactly the same was as for a
network model:

    > model-spec rlog.gp real 0.05:0.5

We also say where the training and (optionally) the test data comes
from using 'data-spec':

    > data-spec rlog.gp 1 1 / rdata@1:100 . rdata@101: .

The model and data specifications can be viewed by invoking these
programs with just the name of a log file.

We are now ready to sample from the posterior distribution of the
hyperparameters for the Gaussian process model.  To start, we can fix
the hyperparameters at reasonable initial values, using 'gp-gen':

    > gp-gen rlog.gp fix 0.5 0.1

This fixes the scale hyperparameters to 0.5 and the relevance
hyperparameters to 0.1 (linear hyperparameters, if present, would be
fixed to the product of these).  By default, the hyperparameters are
set to the "width" value from their prior specification.  Because the
priors are often vague (as here), this may not be a very reasonable
starting point.

We now specify the Markov chain operations to be used in sampling.
There are a great many possibilities for these operations.  Here is
one reasonable method:

    > mc-spec rlog.gp heatbath hybrid 20:4 0.5

This uses hybrid Monte Carlo, with trajectories 20 leapfrog steps
long, with a window of 4 states.  The stepsize adjustment factor used
is 0.5.  If the rejection rate turns out to be too high (as can be
checked using the 'gp-plt t r rlog.gp' command), the stepsize should
be adjusted downward.

To perform these sampling operations 100 times, we use the following
command:

    > gp-mc rlog.gp 100

This takes 0.89 seconds on the system described in Ex-test-system.doc.
We can use 'gp-plt' or 'gp-display' to see how things went.  The
quantities that can be plotted with 'gp-plt' are similar to those that
can be plotted using 'net-plt', except that quantities relating to
test cases have been omitted, since they would often take a long time
to compute (the 'E' and 'H' quantities, defined in the "mc" module,
may also take a long time).  See gp-quantities.doc for details.  

The following will plot the hyperparameters being sampled in this run:

    > gp-plt t nS1R1 rlog.gp | plot

Using a plotting option that makes the vertical scale be logarithmic
would be useful in this case (this is the -ly option for 'graph').

Iterations from the latter part of the run can be used to make
predictions for test cases.  This is done using 'gp-pred', which
operates much like 'net-pred'.  The following command makes
predictions for the test cases based on the last 80 of the 100
iterations, and reports the average squared error:

    > gp-pred na rlog.gp 21:

    Number of iterations used: 80

    Number of test cases: 1000

    Average squared error guessing mean:    0.01308+-0.00127

This takes less than 0.1 seconds on the sytem used, but predictions
will take longer when the number of training cases is larger, or if
the median or log probability are to be found (options "d" or "p" of
'gp-pred').  As can be seen, the performance of the Gaussian process
model is similar to that of the neural network model for this problem.

The predictions for test cases made above are found directly from the
covariances between the targets in the training cases and the unknown
target in a test case.  The values of the regression function for the
training cases are never explicitly found.  Consequently, it is not
possible to plot quantities such as the squared error on training
cases over the course of the run.  To plot such quantities, you will
have to ask for the function values for training cases to be generated
in each iteration.  This takes a significant amount of time, and can
potentially cause numerical problems, which is why gp-plt won't just
do it as needed.

If you want to be able to plot the squared error on training cases (or
similar quantities such as case-by-case likelihoods), you will need to
change the 'mc-spec' command to the following:

    > mc-spec rlog.gp2 discard-values heatbath hybrid 20:4 0.5 sample-values

The "sample-values" operation at the end generates function values for
all the training cases, which will be stored in the log file.  These
values can later be used to compute the squared error for training
cases, which can be plotted with a command such as

    > gp-plt t b rlog.gp2 | plot

The "discard-values" operation throws away the function values (if
present) before the operations for updating hyperparameters.  Throwing
away information may seem wasteful, but it actually improves
convergence in this context.

Unfortunately, if you make only this change, you will probably get the
following error message when you try to run 'gp-mc':

    Couldn't find Cholesky decomposition (2) in sample-values!

This message is produced when the round-off error in the matrix
computations used by "sample-values" is enough to turn the results
into nonsense.  The problem is due to the poor "conditioning" of the
covariance matrix.  Roughly speaking, the covariances between
neighbouring training cases are so high that knowing all but one
function value is enough to determine the remaining function value to
a precision comparable to the level of round-off error.

To fix this, the conditioning of the covariance matrix must be improved.
Changing the 'gp-spec' command as follows seems to be sufficient:

    > gp-spec rlog.gp2 1 1 10 - 0.01 / 0.05:0.5 0.05:0.5

There are two changes here from the 'gp-spec' command used before.
First, the constant part of the covariance has been reduced from 100
to 10, which makes little difference when the data is centred at about
zero, as it is for this problem.  Since arithmetic is done in floating
point, this increases the effective precision of the covariances.
Second, the covariance now includes a "jitter" part of 0.01 (the "-"
preceding this indicates that there is still no linear part).  Jitter
is much like noise, in that it varies independently from one case to
another, but it is considered part of the function value, which noise
is not.  The jitter makes all the function values less predictable,
reducing the problem of poor conditioning.  Jitter plays a more
crucial role for binary and class models.