EXAMPLES OF BAYESIAN NETWORK TRAINING

This section shows how Bayesian training for neural network models can
be done for three simple synthetic problems.

The output shown below was obtained by running the software on our
machine, with ">" at the start of a line indicating a command line
that was input.  It is possible (even likely) that your results will
differ, even if you have installed the software correctly, since small
differences in floating point arithmetic can be magnified into large
differences in the course of the simulation.  However, unless one of
the simulations became stuck in an isolated local mode, the final
predictions you obtain from 'net-pred' should be close to those
reported below.

All the data sets mentioned here are present in the 'examples'
sub-directory, along with the C source of the programs that generated
them.  It is assumed below that you have changed to this directory.
The command sequences for running the simulations that are mentioned
below are also stored in this directory, in shell files with the names
'rcmds', 'bcmds', and 'ccmds'.

Note that the particular network architectures, priors, and Markov
chain sampling options used below are only examples of reasonable
choices.  There are many other possibilities that are also reasonable.
To gain a full understanding of the various possibilities, and their
the detailed documentation in the ".doc" files.

A simple regression problem

As a first example, we will look at a simple regression problem,
in which there is one real-valued input for each case, and one
real-valued target, whose value is to be predicted.

I generated synthetic data of this type in which the input variable,
x, for each case had a standard Gaussian distribution and the
corresponding target value came from a Gaussian distribution with
standard deviation 0.1 and mean given by

0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2)

I generated 200 cases in total, stored in the file 'rdata'.  Each case
consists of a line containing first the input value and then the
target value.  The first 100 of these cases are meant for training,
and the second 100 for testing.

We will model this data using a multilayer perceptron with one input
unit, one hidden layer of eight tanh units, and a single output unit
whose activation function is the identity.  The value of the output
unit will be taken as the mean of a Gaussian distribution for the
target, with the standard deviation of this Gaussian (the noise level)
being a hyperparameter to be estimated along with the parameters of
the network.  We will also use hyperparameters to express the prior
distributions for the network parameters.  Specifically, we will use
one hyperparameter for the input-to-hidden weights, one for the hidden
unit biases, and one for the hidden-to-output weights.  The output
unit bias will be given a simple Gaussian prior, with no adjustable
hyperparameter.  (The role of hyperparameters is primarily to
introduce dependencies between parameters, so they are usually not
used when they would control just a single parameter.)

The first step in applying this model to the data is to create a log
file containing the specifications for the network architecture and
the priors to use for the network parameters.  This can be done using
the following command:

> net-spec rlog 1 8 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100

Here, 'rlog' is the name of the new log file, and the arguments "1",
"8", and "1", specify the numbers of input, hidden, and output units.
Following the "/", the priors for the various groups of network
parameters are given, with a "-" indicating that a parameter group
should be omitted (equivalent to the parameters being zero).  The
order of the groups is a bit hard to remember, but you can get a
summary easily by just invoking 'net-spec' with no arguments.  The
groups in the above command that are not omitted are the input-hidden
weights, the hidden biases, the hidden-output weights, and the output
bias.

In general, the prior specifications used in the net-spec command
consist of a "width" value followed by up to three "alpha" values,
with perhaps an option character tacked on to the front.  For the full
details, see Appendix A of my thesis and prior.doc.  Here, I will just
comment on the particular priors used above.

The prior specification used for the output bias is simply "100",
which means that the bias has a Gaussian prior with mean zero and
standard deviation 100.  The prior specifications of the form
"0.05:0.5" indicate that the parameters in these groups are associated
with a hyperparameter, which gives the standard deviation of a
Gaussian prior for these parameters.  The hyperparameter itself has a
rather vague prior that spans several orders of magnitude around one.
The inverse gamma priors used are somewhat difficult to visualize,
because their tails are asymmetrical, but some standard choices are
often appropriate.  Here, the "0.5" after the colon controls how vague
the prior is (closer to zero is more vague).  The "0.05" specifies the
location of this vague distribution, but due to the asymmetry of the
tails, it is closer to being the lower limit of the prior than the
centre (for vague priors such as this).

The "x" in front of the prior for the hidden-to-output weights
indicates that the prior should be automatically rescaled based on the
number of hidden units, so as to produce an effect that is independent
of the number of hidden units (in the limit of large numbers).

Once the network has been specified, we need to say how the network
outputs will be used to model the targets (response variables) in the
data set.  We do this with the 'model-spec' command:

> model-spec rlog real 0.05:0.5

In this case, the targets are real-valued, and are modeled as the
network output plus Gaussian noise, with the noise standard deviation
being a hyperparameter having the prior given by the last argument of
the command.  The syntax of the prior specification is the same as
for the priors on network parameters.

You can view the architecture and prior specifications stored in the
log file by invoking 'net-spec' with just a log file argument.  In
this example, this should give the following result:

> net-spec rlog

Network Architecture:

Size of input layer:    1
Sizes of hidden layers: 8
Size of output layer:   1

Data model: real

Prior Specifications:

Hidden Layer 0

Input-Hidden Weights:    0.050:0.50
Hidden Biases:           0.050:0.50

Output Layer

Hidden0-Output Weights: x0.050:0.50
Input-Output Weights:    0.050:0.50
Output Biases:           100.000

You can also view the model specification by invoking 'model-spec'
with just one argument giving the log file.

Once the network and data model have been specified, we need to
specify the data sets to be used for training and (optionally) for
testing.  We do this using the 'data-spec' command:

> data-spec rlog 1 1 / rdata@1:100 . rdata@101:200 .
Number of training cases: 100
Number of test cases: 100

Here, "rlog" is the name of the log file we created with 'net-spec',
to which the data specifications will be appended.  The "1" and "1"
arguments give the numbers of inputs and targets.  These must be
consistent with the network architecture (if not, an error will be
reported later when you try to start the training).

After the "/", specifications for where to get the training and test
data are given.  Each such specification consists of two parts: the
source for the inputs, and the source for the targets.  The
specification "rdata@1:100" means that the training inputs come from
the file 'rdata', in lines 1 to 100, while the specification of
"rdata@101:200" for the test inputs indicates that they also come from
the file 'rdata', but in lines 101 to 200.  In the above command, the
sources for the targets are given as just ".", which means the target
items are on the same lines as the inputs, following the last input
item.  We could have said that the targets come from a completely
different file, however.  It is also possible to specify exactly where
on a line the inputs and targets are located (and hence to ignore some
items in the file).  For documentation on these and other features,
see numin.doc.

Though it is not done above, the 'data-spec' command also allows you
to specify transformations to be applied to the inputs or targets
before they are used.  This is useful, for example, if you wish to use
inputs that have been "normalized" to have mean zero and variance one,
based on the training data.  See data-spec.doc for the details.

In the training runs reported in the thesis, I used a short "initial
phase" to get things started, followed by a long "sampling phase" to
bring the simulation to equilibrium and then produce a sample of
networks from the posterior for use in prediction.  I still use the
same general procedure, but with some changes to how the initial phase
is done.

It seems desirable to start the simulation in a state where the
hyperparameters take on moderate values, and leave them fixed for a
few iterations so that the network parameters will also take on
moderate values.  This can be accomplished using the following
commands:

> net-gen rlog fix 0.5
> mc-spec rlog repeat 10 sample-noise heatbath hybrid 100:10 0.2
> net-mc rlog 1

The 'net-gen' command stores a network in the log file with index
zero, in which the hyperparameters have values of 0.5, and the network
parameters are zero.  This is the initial state of the simulation run.
The following 'mc-spec' command specifies the Markov chain operations
to be performed in the initial phase.  Here, each iteration consists
of ten repetitions of the following steps:  Gibbs sampling for the
noise level, a heatbath replacement of the momentum variables, and a
hybrid Monte Carlo update with a trajectory 100 leapfrog steps long,
using a window of 10, and a stepsize adjustment factor of 0.2.  Note
that the hyperparameters are not updated, and hence will remain fixed
at values of 0.5.  Finally, a single such iteration is done by calling
'net-mc' with an iteration limit of 1.

The stepsize adjustment factor of 0.2 used above is typical of what is
needed, but will not be appropriate in all circumstances.  After the
'net-mc' command has finished, the number of the 10 hybrid Monte Carlo
updates that were rejected can be determined using the command
'net-plt t r rlog', which will write the iteration number (of 1) and
the rejection rate on standard output.  If the rejection rate is high
(say, over 0.3), a new run should be done using a smaller stepsize
adjustment factor.  In the initial phase, one would generally start by
guessing a value for the stepsize adjustment factor that is on the low
side, since there is no point in optimizing this choice.

At this point, we hope to have a network stored in the log file (with
index 1) that has values for both the parameters and hyperparameters
that are of moderate magnitude, and which have adapted at least
somewhat to the training data.  We can now start serious sampling with
the following commands:

> mc-spec rlog sample-sigmas heatbath hybrid 1000:10 0.4
> net-mc rlog 100

The 'mc-spec' command appends a new set of Markov chain operations to
the log file, which will override the previous set.  These operations
are Gibbs sampling for both the hyperparameters and the noise level
(the "sigmas"), a heatbath update for the momentum variables, and a
hybrid Monte Carlo update with a trajectory 1000 leapfrog steps long,
a window of 10, and a stepsize adjustment factor of 0.4.  A long
trajectory length is typically desirable for the sampling phase.  As
in the initial phase, the stepsize adjustment factor of 0.4 used is
typical, but not universally applicable.  It may pay at this stage to
experiment in order to find out how large this factor can be while
keeping the rejection rate low.  The use of a "window" of around 10
states costs little and is often beneficial.

The 100 iterations of the sampling phase started with the command
'net-mc rlog 100' will take a while to complete (about six minutes on
our SGI machine).  While you wait, you can look at the last network
saved in the log file (or any earlier one) using 'net-display'.  For
example, after about a minute, you might see the following:

> net-display rlog

Network in file "rlog" with index 15

Input to Hidden Layer 0 Weights [1]

3.36 3.36: -1.24  +1.73  -4.04  -0.24  +2.49  -2.95  -1.56  -3.21

Hidden Layer 0 Biases [2]

1.90: -1.58  -3.59  +4.42  +0.85  -4.59  -3.69  +2.03  -0.40

Hidden Layer 0 to Output Weights [3]

0.76 0.76: -1.23

0.76: +1.04

0.76: +0.34

0.76: -0.38

0.76: -0.24

0.76: +0.80

0.76: +0.52

0.76: -0.56

Output Biases [4]

100.00: +1.10

Noise levels

0.08 -  0.08

This display of network parameters and hyperparameters is divided into
sections for different parameter groups.  Within each section, the
numbers before the colons are hyperparameters, those after are
parameters (weight and biases).  There are more hyperparameters shown
than were mentioned earlier, but for this network architecture, the
extra hyperparameters are either fixed in value (the 100 for output
biases), or tied to the value of a higher-level hyperparameter, so
they are effectively not present.

The parameter groups in the 'net-display' output are identified by
numbers in square brackets.  These can be used with the 'h', 'w',
and 'W' quantities of 'net-plt'.  For example, to see how the
hyperparameter controlling the hidden-to-output weights has changed
during the simulation (so far), one can use the command

> net-plt t h3 rlog | plot

where 'plot' is some suitable plot program.  (One can also just invoke
net-plt and look at the numbers printed on standard output.)  Here
'h3' refers to the top-level hyperparameter for group 3, which is seen
in the output of 'net-display' above to be the hidden-to-output group.

By looking at plots of the hyperparameters and quantities such as the
squared error on the training set ('b'), one can get an idea of when
the simulation has reached equilibrium.  Networks from that point on
can then be used to make predictions for test case using the
'net-pred' program.  Often, it will not be completely clear that
equilibrium has been reached until the simulation has been allowed to
proceed for quite a long time, but predictions based on shorter runs
may nevertheless be quite good.

For this problem, let's assume that we have decided to discard the
first 20 iterations as perhaps not coming from the equilibrium
distribution.  The following command will use the networks from the
remaining 80 iterations to produce predictions for all test cases, and
report the average squared error:

> net-pred itn rlog 21:

Number of networks used: 80

Case  Inputs  Targets  Means   Error^2

1    0.92    1.49     1.59  0.0093
2    0.71    1.83     1.79  0.0012
3    0.20    1.72     1.67  0.0021

( midde lines omitted )

98   -0.69    0.35     0.35  0.0000
99   -1.33    0.19     0.36  0.0277
100   -0.09    1.31     1.24  0.0052

Average squared error guessing mean:   0.01035+-0.00147

The options "itn" specified ask for a listing of the inputs ("i") and
targets ("t") for each case, along with the mean ("n") output for that
case of the 80 networks used for prediction.  The squared error when
using this mean to predict the target is shown for each case, and the
average squared error for the test cases is shown at the bottom, along
with its standard error with respect to the random selection of test
cases.  Considering that the average squared error with optimal
prediction is 0.01 (due to the noise of standard deviation 0.1 added
when generating the data), the network model has done quite well, as
one would hope it would on an easy problem such as this.

It is also possible to get predictions for cases that are not in the
test set that was specified with 'data-spec'.  For example:

> net-pred nb rlog 11: / "%echo 2.3"
1.37

Here, the options "nb" ask for only the predictive mean, with "bare"
output (no headings).  The argument at the end says that the inputs
for test cases (here, just one case) should be taken from the output
of the Unix command "echo 2.3", which just outputs the number 2.3.

A problem with a binary response

As a second example, I generated a data set with two real-valued
inputs, x1 and x2, and a binary target.  The inputs were drawn
independently from standard Gaussian distributions.  The target was
then set to "1" with the following probability:

exp ( - ((x1-0.4)^2 + (x2+0.3)^2) ^ 2 )

ie, the negative exponential of the fourth power of the distance of
(x1,x2) from (0.5,-0.3).  I generated 500 cases, and used the first
300 for training and the remaining 200 for testing.

For this problem, we can again try using a network with one layer of
hidden units (fifteen for this problem), and a single output unit.
For a binary target, a Gaussian data model would be inappropriate;
instead, we can use a logistic regression model, in which the
probability of the target being "1" is obtained by passing the value
of the output unit through the logistic function, f(x)=1/(1+exp(-x)).

The network, model, and data specification commands needed are quite
similar to those used in the regression problem above:

> net-spec blog 2 15 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100
> model-spec blog binary
> data-spec blog 2 1 2 / bdata@1:300 . bdata@301:500 .
Number of training cases: 300
Number of test cases: 200

The 'net-spec' command differs only in the number of input units (2),
the number of hidden units (15).  The data model used is "binary",
with no need for a noise level prior.

The 'data-spec' command also says that there are two inputs, and one
target.  It also has a third argument of "2" just before the "/",
which indicates that the target must be an integer with two possible
values (which are "0" and "1").

The initial phase commands that were used for the simple regression
problem turn out to be adequate for this problem as well:

> net-gen blog fix 0.5
> mc-spec blog repeat 10 sample-noise heatbath hybrid 100:10 0.2
> net-mc blog 1

For the sampling phase, we could also try using commands similar to
those presented above, but we might as well try something different
(the mc-spec command needed here is long, so it is continued by
putting a "\" at the end of the first line):

> mc-spec blog repeat 10 sample-sigmas heatbath 0.95 \
hybrid 100:10 0.3 negate
> net-mc blog 200

The above 'mc-spec' command specifies the variant of hybrid Monte
Carlo with "persistence" in which relatively short trajectories are
used, but random walks are suppressed by only partially replacing the
momentum in the 'heatbath' step.  Note that for this to work well, the
rejection rate must be quite small.  This alternative method has the
(with 'sample-sigmas').

On our SGI machine, the 200 iterations requested above take about 70
minutes.  During this time, one can monitor the simulation, for
instance with the command:

> net-plt t h1h2h3 blog | plot

where again, 'plot' is some appropriate plotting program, which in
this case must be capable of plotting three superimposed graphs, for
the three hyperparameters h1, h2, and h3.  The values for these
hyperparameters exhibit quite a high degree of autocorrelation, which
is why it is advisable to allow the simulation to go for 200
iterations, in order to be more confident that the simulation has
explored all high-probability regions of the posterior distribution.

Once the run has finished, we can make predictions using 'net-pred',
based, say, on the networks from the final 3/4 of the run (which is a
generally reasonable portion).  For a problem such as this, where the
response is binary, we are most likely interested in guessing whether
the target is "0" or "1", with performance measured by how often we
are right.  We can get such predictions as follows:

> net-pred itm blog 51:

Number of networks used: 150

Case  Inputs         Targets  Guesses  Wrong?

1   -1.56   0.90    0.00       0       0
2    0.09  -0.13    1.00       1       0
3   -0.79   0.85    0.00       0       0

( midde lines omitted )

198    1.49   0.70    0.00       0       0
199   -2.23  -0.28    0.00       0       0
200   -0.91  -0.03    0.00       0       0

Fraction of guesses that were wrong:  0.0900+-0.0203

The "itm" options ask for output listing the inputs, the targets, and
the guesses based on the mode of the predictive distribution.  A
summary is also printed that reports the fraction of guesses that were
wrong, along with the standard error for this estimate of the error
rate.

A three-way classification problem

As a final example, I created a synthetic data set for a three-way
classification problem.  Data items were generated by first drawing
quantities x1, x2, x3, and x4 independently from distributions uniform
over (0,1).  The class of the item, represented by "0", "1", or "2",
was then selected as follows:  if the two-dimensional Euclidean
distance of (x1,x2) from the point (0.4,0.5) was less than 0.35, the
class was set to "0"; otherwise, if 0.8*x1+1.8*x2 was less than 0.6,
the class was set to "1"; and if neither of these conditions held, the
class was set to "2".  Note that x3 and x4 have no effect on the
class.  The class selected by this procedure was the target value for
the network; the inputs available to the network were not x1, x2, x3,
and x4 themselves, however, but rather these four values plus Gaussian
noise of standard deviation 0.1.  I generated 1000 cases in this way,
of which 400 were used for training and 600 for testing.

This example will illustrate the "softmax" model for target values in
a small set, and the Automatic Relevance Determination (ARD) prior for
input-to-hidden weights.  The network and model specifications used
were as follows:

> net-spec clog 4 8 3 / - 0.05:0.5:0.5 0.05:0.5 - x0.05:0.5 - 0.05:0.5
> model-spec clog class

This specifies a network with 4 input units, 8 hidden units, and 3
output units.  The output units have identity activation functions,
but their values are used in a "softmax" data model, in which the
probability for target i is exp(o_i) / SUM_j exp(o_j), where the o_j
are the values of the output units.

The prior specified for input-to-hidden weights, "0.05:0.5:0.5", has
two "alpha" values, indicating that there is both a high-level
hyperparameter controlling the overall magnitude of input-to-hidden
weights, and lower-level hyperparameters for each input unit, which
control the magnitudes of weights out of each input.  If some of the
inputs are irrelevant (as are the third and fourth inputs in this
problem), we hope that the corresponding hyperparameters will take on
small values, forcing the weights out of these inputs to be close to
zero, and thereby avoiding any damage to predictive performance that
could result from the inclusion of these irrelevant inputs.

We need to use the following data specification for this problem:

> data-spec clog 4 1 3 / cdata@1:400 . cdata@401:1000 .
Number of training cases: 400
Number of test cases: 600

The arguments "4" and "1" are the numbers of input and target values.
The "3" before the "/" says that each target must be one of the
integers "0", "1", or "2".  This "3" must match the number of softmax
output units specified in the network architecture, or an error will
result when training is started.

The initial and sampling phases of the simulation can be done using
the same approach as was used above for the binary data set:

> net-gen clog fix 0.5
> mc-spec clog repeat 10 sample-noise heatbath hybrid 100:10 0.2
> net-mc clog 1
> mc-spec clog repeat 10 sample-sigmas heatbath 0.95 \
hybrid 100:10 0.3 negate
> net-mc clog 200

This takes about 75 minutes on our SGI machine.

We can see the effect of the ARD model by looking at the values of
the low-level input-hidden hyperparameters over the course of the
simulation, with the command:

> net-plt t h1@ clog | plot

The third and fourth hyperparameters are much smaller than the first
two, indicating the ARD has done its job.

Predictions for test cases based on the predictive mode can now be
done in the same way as for the binary response problem.  If we are
interested only in the estimated classification performance, we can
use the following command:

> net-pred ma clog 51:

Number of networks used: 150

Number of test cases: 600

Fraction of guesses that were wrong:  0.1367+-0.0140

Here, the "a" option is used to suppress everything except the summary.