EXAMPLES OF BAYESIAN NEURAL NETWORK IMAGE MODELS

One prominent area of application for neural network models is
classification of images.  Here, a fairly simple artificial example of
such an image classification problem is presented.  Several Bayesian
neural network models are applied to this task, including ones that
use convolutional connections.  Fitting such models by simple gradient
descent is also demonstrated.

These examples also illustrate how the training data can be divided in
several ways into a set of cases used for fitting, and a set of
validation cases, which can be used to assess performance.  Although
Bayesian methods do not require a validation set in theory, in
practice, some check on whether good performance has been achieved is
advisable before actually using predictions from a Bayesian model, to
guard against statistical mistakes, poor convergence of MCMC methods,
or, if nothing else, out-and-out bugs in the scripts or programs used.

The computation times given below were obtained running on a CPU, with
arithmetic done in single precision.  Running the calls of net-mc on a
GPU might reduce the time considerably.

The command files for this example are in the ex-image sub-directory.


The image classification problem used in these examples.

The program in igen.c generates artificial 6x6 monochrome images in
which one of the symbols "+", "X", "O", or "H" appears in some 3x3
patch in the image.  The classification task is to predict from the 36
pixels of an image which symbol appears in some patch in the image
(whose location could be any of the sixteen possibilities).  The file
'idata' has 21536 generated cases, the first 1536 of which are used
for training, and the remaining 20000 for testing.

Generation of an image starts by randomly generating background pixel
values from a Gaussian distribution with mean zero and standard
deviation one, independently for each of the 36 pixels.  A location
for a symbol is then randomly gnerated from the 16 possible positions.
(Note that the centre position of the symbol cannot be in the first or
last row, or the first or last column, leaving four possible row
positions and four possible column positions.)  The identity of the
symbol is also picked randomly from "+", "X", "O", and "H" (coded as
0, 1, 2, or 3), with each symbol having probability 1/4.  The pixels
in the 3x3 patch where the symbol is located are then replaced by the
following patterns for each symbol:

       "+"         "X"         "O"         "H"

    -1 +1 -1    +1 -1 +1    +1 +1 +1    +1 -1 +1
    +1 +1 +1    -1 +1 -1    +1 -1 +1    +1 +1 +1
    -1 +1 -1    +1 -1 +1    +1 +1 +1    +1 -1 +1

Finally, Gaussian noise with mean zero and standard deviation 0.5 is
added to all pixels, and the pixel values are rounded to three decimal
places.

When passed an argument, the igen program prints each image, with the
true location and identity of the symbol, as well as the probabilities
for each symbol that can be inferred from the pixels using the true
model of how these images are generated.  Here are three of the images
generated:
  

  Case 2, Class +(0), Centred at 1,3, PP: +0.999 X0.000 O0.001 H0.000
  
        0      1      2      3      4      5       0 1 2 3 4 5
  0   -0.626 +1.056 -0.845 +0.762 -0.477 -0.837    O # O #   O
  1   +0.723 +0.971 +1.134 +1.040 +1.059 +0.147    # # # # #
  2   -1.095 -2.091 -0.671 +0.971 -0.649 +0.715    O O O # O #
  3   +0.787 -1.511 +1.770 +1.620 +0.221 -0.582    # O # #   O
  4   +0.535 -1.675 -0.308 -0.815 -0.786 +0.273    # O   O O
  5   -0.215 +0.335 -1.835 -0.286 +1.338 +0.336        O   #
    
    
  Case 11, Class X(1), Centred at 2,4, PP: +0.157 X0.843 O0.000 H0.000
  
        0      1      2      3      4      5       0 1 2 3 4 5
  0   -0.818 +0.064 -0.260 +0.159 +0.272 +0.009    O
  1   +0.236 +1.215 -0.756 -0.263 -1.089 +1.030      # O   O #
  2   -1.443 -1.099 +0.870 -0.965 +1.381 -1.830    O O # O # O
  3   -1.044 +1.340 +0.445 +1.512 -1.571 +0.600    O #   # O #
  4   -0.776 +0.444 +1.656 -1.832 +0.346 -0.823    O   # O   O
  5   -0.431 +0.744 +1.261 +3.750 -1.659 +1.955      # # # O #
  
  
  Case 60, Class O(2), Centred at 4,1, PP: +0.000 X0.000 O0.418 H0.582
  
        0      1      2      3      4      5       0 1 2 3 4 5
  0   +1.576 -1.255 +1.379 +1.325 -0.615 +0.105    # O # # O
  1   +0.734 -0.416 -1.230 -0.001 -1.100 +0.948    #   O   O #
  2   +0.837 -1.332 +0.800 -2.698 +1.139 +0.417    # O # O #
  3   +1.236 +0.820 +1.372 -0.339 +0.924 +0.754    # # #   # #
  4   +0.508 -1.105 +0.611 +2.562 -1.724 -1.193    # O # # O O
  5   +0.715 +1.017 +0.755 -0.284 -1.175 +1.185    # # #   O #


The diagrams at the right show the pixels of the image thresholded at
values less than -0.5 (O), greater than +0.5 (#), and between -0.5 and
+0.5 (space).  For the second image (case 11), noise in the pixels has
made the symbol less certain (probability 0.843 for the correct
symbol) than for the first image (case 2) (probability 0.999 for the
correct symbol), and for the third image (case 60), the correct symbol
has lower probability (0.418) than one of the incorrect symbols
(0.582.  All these probabilities are computed assuming the true
generation model is known.

The igen program computes the performance on test cases when using the
true model, which is an upper limit on what can be accomplished when
learning a model from the training cases (unless one is just lucky).
Here is the summary:

  Error rate on test cases with true model: 0.069
  Average squared error for test cases with true model: 0.100
  Average log probability for test cases with true model: -0.174


A fully-connected Bayesian neural network model.

We can first try a neural network model that has no built-in knowledge
of how the 36 pixel values are arranged in a 6x6 image, or any
specific knowledge that the symbol to be detected is in a 3x3 patch of
this image.  The model simply feeds the 36 inputs into a hidden layer
with 80 "softplus" units, which in turn feed into an output layer with
4 units, which are used to define the class probabilities (using the
"softmax" scheme), as in the classification example in Ex-netgp-c.doc.

This model can be specified with the following commands:

  net-spec $log 36 80 softplus 4 \
                / ih=0.05:3::1.5 bh=0.3:2 ho=0.1:3::4 bo=0.1:2
  model-spec $log class

Here, $log represents the name of the log file.  This is followed by
the specification that there are 36 input, 80 hidden units with
"softplus" activation function, and 4 units used to produce class
probabilities using the "softmax" scheme.

The prior specification for input-to-hidden weights, of 0.05:3::1.5,
does not allow for different inputs to have different relevance (there
is no number between the second and third colons, so there is no
variation between input groups), since we will assume that it is known
that the inputs are almost equally relevant (since the symbol to be
recognized might occur anywhere in the image).  The final 1.5 in the
prior specification gives the input-to-hidden weights a heavy-tailed
prior (a t distribution with 1.5 degrees of freedom), allowing for the
possibility that the input-to-hidden connections are effectively
sparse, with a hidden unit looking mostly at only a few of the pixel
values.  The hidden-to-output weights also have a (less extreme)
heavy-tailed prior, allowing for some of the hidden units to have
small weights to some of the output units.

A scheme allowing cross-validation assessments of performance is used,
in which four training runs are done, each on 3/4 of the training
data, with the omitted 1/4 allowing for monitoring of performance (to
see, for example, whether it is advisable to do further MCMC
iterations).  Of course, making such assessments based on the test
data would be "cheating" (since in real life, the test targets, and
probably the test inputs as well, are not available when training).

The data specification used is therefore as follows:

  data-spec $log 36 1 4 / idata-train@-$start:$end . idata-train@$start:$end .

Here $start and $end represent the range of training cases comprising
the validation set (here specified as the "test" set), with the
remainder of the 1536 training cases used for fitting the model.  In
the four runs that are done, $start and $end will be set to 1 and 384,
385 and 768, 769 and 1152, and 1153 and 1536.  (These runs will use
log files of ilog1.net1, ilog1.net2, ilog1.net3, and ilog1.net4.)

When final predictions for test cases are made, the "test" data
specified in this data-spec command will be overridden by the actual
test data, which is in idata-test.

The numbers in the data-spec command specify that there are 36 inputs,
and 1 target, which has 4 possible values (coded as 0, 1, 2, 3).

With the network model and data specified, training is done as follows:

  rand-seed $log $run

  net-gen $log fix 0.1

  mc-spec $log repeat 120 heatbath hybrid 200:20 0.05
  net-mc $log 1

  mc-spec $log repeat 60 sample-sigmas heatbath hybrid 400:40 0.07
  net-mc $log 2

  mc-spec $log repeat 24 sample-sigmas heatbath 0.9 hybrid 1000:100 0.09 negate
  net-mc $log 1500

Here, $run is 1, 2, 3, or 4, identifying which of the four runs this
is, which affects the random number seed used (and the log file name).
The hyperparameters are initialized to the fairly small value of 0.1,
and 120 HMC updates are done with the hyperparameters fixed at this
value, so that the parameters will take on reasonable values for the
data before trying to update the hyperparameters.  After this, 60 HMC
updates are done with hyperparmeters updated before each HMC update,
using a moderate number (400) of leapfrog steps.  This is intended to
get the hyperparameters to reasonable values (given how well the data
has been fitted so far).  The main sampling iterations are then done,
in which updates for the hyperparameters alternate with HMC iterations
using a longer trajectory of 1000 leapfrog iterations.  Each of the
1498 saved states from this main sampling phase are the result of 24
such pairs of hyperparameter updates and HMC updates for parameters.

The HMC updates all use the "windowed" method, to increase acceptance
rate, with the windows being 1/10 of the trajectory.  Relatively small
leapfrog stepsizes are used initially, to guard against rejection
rates being high when starting at an atypical location.  To further
reduce random walk behaviour, without reducing the number of
hyperparameter updates, the "persistent momentum" version of HMC is
used for the final sampling phase.

Each full iteration of this sampling procedure takes about 27 seconds
on the system used (Ex-test-system.doc), for a total of about 690
minutes (about 11.5 hours) for the 1500 iterations done.  The command
script in ex-image/icmds1.net does all four runs in parallel, so this
will be the total wall-clock time if the system used has at least four
processor cores.

Various plots can be made to check whether these runs are sampling (or
have sampled) well, or whether more iterations (or a better scheme)
are needed.  The performance of the HMC updates can be checked by
plotting the rejection rates for each iteration:

  net-plt t r ilog1.net? | plot-points

As seen in ilog1-r.png (with different colours for each run), most of
the iterations have rejection rates below 0.2, but not very close to
0, with no apparent trend, so the stepsize seems reasonable.  The
average rejection rate can be found as follows:

  net-tbl r ilog1.net? | series m

which gives a mean of 0.032225, small enough that there should not be
much problem with trajectory reversals from rejections when using
persistent momentum with factor 0.9 (which randomizes momentum on a
time scale of about 1/(1-0.9) = 10 iterations, less than 1/0.032225).

The fit (squared error) for the training data can be checked with

  net-plt t b ilog1.net? | plot-points

As seen in ilog1-b.png, all four runs seem to reach the equilibrium
distribution for this value in 100 iterations or less.  (The
equilibrium distributions differ slightly between runs because they
have different training sets.)

The squared error for each iteration on the validation cases can be
seen with

  net-plt t B ilog1.net? | plot-points

As seen in ilog1-BB.png, this quantity also seems to have reached
equilibrium in less than 100 iterations.  

For both the 'b' and 'B' quantities, one can see some long-lag
autocorrelation, indicating that longer runs would be needed to get
really good samples from the posterior distributions.  But the run
length done here seems adequate for purposes of producing predictions.

Note that although the average squared error on validation cases is
useful in seeing how well the Markov chain is sampling, the error from
a single iteration does NOT tell one how good the predictions on these
validation cases found by averaging many iterations would be - poor
predictions from a single iteration may be the result of the model
producing high confidence predictions at each iteration, that vary
from iteration to iteration, so that the averaged prediction has much
lower confidence (and perhaps much better performance).  So there is
not necessarily anything wrong when the error on validation cases from
individual networks is high.

We can look at the main hyperparameters (standard deviations for
input-hidden weights, hidden biases, and hidden-output weights) at
each iteration of the first run as follows:

  net-plt t h1h2h3 ilog1.net1 | plot-points-log-scale

This data is best plotted on a log scale, because the three
hyperparameters differ greatly in magnitude.  The plot is in
ilog1-h1h2h3.png.  Note that h1 (input-to-hidden) is red, h2 (hidden
biases) is green, and h3 (hidden-to-output) is blue.  From this plot,
equilibrium seems to have been reached after about 300 iterations.
  
Once all runs have finished, we can assess performance on the actual
test set as follows (with the first 500 iterations discarded as
"burn-in", which seems more than sufficient given the plots above):

  net-pred mnpa ilog1.net1 501: ilog1.net2 501: \
                ilog1.net3 501: ilog1.net4 501: / idata-test .

The output is as follows:

  Number of iterations used: 4000

  Number of test cases: 20000

  Average log probability of targets:    -1.058+-0.004
  Fraction of guesses that were wrong:    0.4704+-0.0035
  Average squared error guessing mean:    0.58624+-0.00240

Only slightly worse results would be obtained using runs only one
tenth as long (150 iterations, taking about an hour).  The command

  net-pred mnpa ilog1.net1 71:150 ilog1.net2 71:150 \
                ilog1.net3 71:150 ilog1.net4 71:150 / idata-test .

produces the following output:

  Number of iterations used: 320

  Number of test cases: 20000

  Average log probability of targets:    -1.061+-0.004
  Fraction of guesses that were wrong:    0.4733+-0.0035
  Average squared error guessing mean:    0.58784+-0.00240

Note that guessing by ignoring the inputs and assigning equal
probabilities to the four classes gives an average log probability of
-1.386, an error rate of 0.75, and an average squared error of 0.75.
So this model performs significantly better than random guessing.
However, the error rate of 47% is considerably more than the 6.9%
error rate that is achievable with knowledge of the true generating
process, so there is much room for improvement.


A convolutional Bayesian neural network model.

Supposing that we know that the symbol to be recognized is a 3 pixel
by 3 pixel patch in a 6 by 6 image, and that this patch may appear at
any position in the image, it makes sense to use a convolutional
network, that uses several filters looking at 3x3 patches of the
image.  Each such filter is defined by the 9 weights on connectiions
from a 3x3 patch at some position in the image to a hidden unit, with
the weights for all possible positions being the same.  We might
expect that four filters, one for each class, would be sufficient, but
the network defined here will allow for five filters (computing five
output channels), which may make it easier to find the four needed,
since a "spare" filter may allow only-locally-optimal modes to be more
easily escaped.

The simplest model of this type has a single hidden layer with 80
hidden units (5 filters/channels, each needing 16 units for all
possible symbol positions), which connect to the four output units
used to define class probabilities.  Such a network model can be
specified as follows:

  net-spec $log 36 80 softplus 4 \
                / ih=0.2:2 config:iconfig bh=0.3:2 ho=0.1:3::4 bo=0.1:2
  model-spec $log class

The "input-config" flag after the specification of the prior for
input-to-hidden weights says that the configuration of these weights
is specified in the file "iconfig", whose contents are as follows:

  # Configuration for convolutional layer connections for model of test images.

  D=6  # image width and height
  P=3  # patch width and height
  F=5  # number of filters

  [ 1 1 1 ]

  D-P+1{  # Loop over vertical positions of patch

    D-P+1{  # Loop over horizontal positions of patch

      P{  # Loop over rows of patch

        P{  # Loop over pixels within row of patch

            = = =   F-1{ = + + }  # Connections for all filters 

            [ + = +F ]
        }  

        [ +D = = ]   P[ = = +F ]
      }

      [ + +F 1 ]
    }

    [ +D = 1 ]   D-P+1[ = +F 1 ]
  }

See net-config.doc for an explanation of the specification language
used above.  The effect of this specification is to set up the
convolutional connections, with 45 weights defining 5 filters (each
looking at 9 pixels), which are shared by 16x9x5=720 connections
(which is less than the 36x80=2880 possible connections).

The other commands used are almost the same as for the fully-connected
network model above, except that stepsizes have been increased a bit.

Each full iteration takes about 32 seconds on the system used
(Ex-test-system.doc), for a total of 13.4 hours (with the four runs
done in parallel).  This is actually a bit slower then the time for
the fully-connected network - there are fewer input-to-hidden
connections to handle, but in this case, that advantage is more than
offset by the overhead of processing the non-contiguous configuration
of connections.

As above, we can plot the 'r' (rejection rate), 'b' (squared error on
training data), and 'B' (squared error on validation data) quantities
for the four runs using net-plt.  The results can be seen as
ilog2-r.png, ilog2-b.png, and ilog2-BB.png.  The main hyperparameter
values for the first run can also be plotted, as above, giving the
result in ilog2-h1h2h3.png.

Once all runs have finished, we can assess performance on the actual
test set as follows:

  net-pred mnpa ilog2.net1 501: ilog2.net2 501: \
                ilog2.net3 501: ilog2.net4 501: / idata-test .

Here, the first 500 iterations are discarded as "burn-in".  This seems
mostly sufficient given the plots above, except that the fourth run
may not have reached equilibrium until around iteration 700.

The output is as follows:

  Number of iterations used: 4000

  Number of test cases: 20000

  Average log probability of targets:    -0.731+-0.006
  Fraction of guesses that were wrong:    0.2661+-0.0031
  Average squared error guessing mean:    0.39534+-0.00321

Results only slightly worse would be obtained using runs only half as
long.  The command

  net-pred mnpa ilog2.net1 301:750 ilog2.net2 301:750 \
                ilog2.net3 301:750 ilog2.net4 301:750 / idata-test .

produces the following output:

  Number of iterations used: 1800

  Number of test cases: 20000

  Average log probability of targets:    -0.743+-0.005
  Fraction of guesses that were wrong:    0.2698+-0.0031
  Average squared error guessing mean:    0.40237+-0.00298

These results are considerably better than were obtained above when
using a fully-connected network, demonstrating the advantage of using
a convolutional network when, as here, it is known that it fits the
structure of the problem.

The role of the convolutional connections can be confirmed by looking
at the weights for each of the five filters as 3x3 "receptive fields".
This can be done with the "idisp" shell file (which uses R).  The
output for the last iteration of the first run is as follows:

  , , 1
  
       [,1] [,2] [,3]
  [1,]  4.0  3.4  4.5
  [2,]  5.3 -5.3  4.5
  [3,]  4.1  5.7  5.2
  
  , , 2
  
       [,1] [,2] [,3]
  [1,]  5.0 -6.0  3.4
  [2,]  3.9  3.5  4.7
  [3,]  4.4 -2.0  4.1
  
  , , 3
  
       [,1] [,2] [,3]
  [1,] -4.3  4.8 -4.2
  [2,]  3.7  5.3  3.4
  [3,] -3.2  3.7 -2.7
  
  , , 4
  
       [,1] [,2] [,3]
  [1,]  6.9 -2.4  5.1
  [2,] -5.5  4.0 -3.8
  [3,]  4.4 -4.2  2.7
  
  , , 5
  
       [,1] [,2] [,3]
  [1,]  0.8 -3.6  3.2
  [2,] -3.9  1.5 -3.6
  [3,]  3.9 -1.9  3.0

One can see that the first filter is sensitive to the "O" symbol, the
second to the "H" symbol, the third to the "+" symbol, and the fourth
and fifth to the "X" symbol.


Adding another hidden layer, with more assumptions.

The 26.6% error rate for the model with one convolutional hidden layer
is still not close to the 6.9% error rate obtained using the true
generative model.  We can try adding a second hidden layer to allow
more complex computations that could allow for closer to optimal
performance.  (Though note that performance is also limited by the
amount of training data, not just by the capabilities of the network
used.)

The architecture used has convolutional connections from the inputs to
the first hidden layer, as in the model above, and this hidden layer
is then fully connected to a second hidden layer with 16 tanh units.
The second hidden layer connects to the output units.

In this model, the biases on the first layer of hidden units will be
constrained to be the same for all positions in the image (ie, there
are only 5 bias parameters, one for each of the filters).  Similarly,
the connections from units in the first hidden layer to the second
hidden layer will be the same for all positions.  These constraints
are based on the assumption that the pattern being looked for is
equally likely to occur anywhere in the image.

This network model is specified as follows:

  net-spec $log 36 80 softplus 16 4 \
    / ih=0.2:2 config:iconfig bh0=0.3:2 config:bconfig \
      hh=0.1:2 config:h2config bh1=0.3:2 ho=0.1:2 bo=0.1:2

The bconfig file sets up the biases to be the same for units using the
same filter, as follows:

  # Configuration for convolutional layer biases for model of test images.

  D=6  # image width and height
  P=3  # patch width and height
  F=5  # number of filters

  D-P+1(  # Loop over vertical positions of patch
    D-P+1(  # Loop over horizontal positions of patch
      + 1   F-1( + + )  # biases for all filters
    )
  )

The h2config file specifies that weights on connections from the first
to second hidden layer are shared for units in the first hidden layer
that are for the same filter:

  # Configuration for hidden-hidden connections for model of test images.

  D=6  # image width and height
  P=3  # patch width and height
  F=5  # number of filters
  U=16 # number of units in second hidden layer

  D-P+1(  # Loop over vertical positions of patch
    D-P+1(  # Loop over horizontal positions of patch
      [ = 0 0 ]
      F(          # Loop over filters
        [ + 0 = ]
        U( = + + )  # Loop over units in next hidden layer
      )
    )
  )

The mc-spec and other commands are the same as for the previous two
networks, except for stepsize adjustments.  

The 1500 iterations for the four runs take 20.1 hours on the system
used (Ex-test-system.doc).  Plots of the rejection rate, training set
error, validation set error, and hyperparameter values (for the first
run) can be seen in ilog3-r.png, ilog3-b.png, ilog3-BB.png, and
ilog3-h1h2h3h4h5.png.  

The results on the test set using iterations 501 to 1500 of the four
runs are as follows:

  Number of iterations used: 4000

  Number of test cases: 20000

  Average log probability of targets:    -0.554+-0.005
  Fraction of guesses that were wrong:    0.2198+-0.0029
  Average squared error guessing mean:    0.30820+-0.00323

The 22.0% error rate seen here is a considerable improvement over the
26.6% error rate with the simple convolutional network.

Almost identical results would be obtained using runs only a tenth as
long (about two hours).  Using iterations 71 to 150 of each run
produces these results:

  Number of iterations used: 320

  Number of test cases: 20000

  Average log probability of targets:    -0.556+-0.006
  Fraction of guesses that were wrong:    0.2178+-0.0029
  Average squared error guessing mean:    0.30891+-0.00324