1

A PROBLEM WITH A BINARY RESPONSE

As a second example, I generated a data set with two real-valued
inputs, x1 and x2, and a binary target.  The inputs were drawn
independently from standard Gaussian distributions.  The target was
then set to "1" with the following probability:

    exp ( - ((x1-0.4)^2 + (x2+0.3)^2) ^ 2 )

ie, the negative exponential of the fourth power of the distance of
(x1,x2) from (0.5,-0.3).  I generated 500 cases, and used the first
300 for training and the remaining 200 for testing.


A neural network model for the binary response problem.

For this problem, we can again try using a network with one layer of
hidden units (fifteen for this problem), and a single output unit.
For a binary target, a Gaussian data model would be inappropriate;
instead, we can use a logistic regression model, in which the
probability of the target being "1" is obtained by passing the value
of the output unit through the logistic function, f(x)=1/(1+exp(-x)).

The network, model, and data specification commands needed are quite
similar to those used in the regression problem above:

    > net-spec blog.net 2 15 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 
    > model-spec blog.net binary
    > data-spec blog.net 2 1 2 / bdata@1:300 . bdata@301:500 .
    Number of training cases: 300
    Number of test cases: 200

The 'net-spec' command differs only in the number of input units (2),
the number of hidden units (15).  The data model used is "binary",
with no need for a noise level prior.

The 'data-spec' command also says that there are two inputs, and one
target.  It also has a third argument of "2" just before the "/",
which indicates that the target must be an integer with two possible
values (which are "0" and "1").

The initial phase commands that were used for the simple regression
problem turn out to be adequate for this problem as well:

    > net-gen blog.net fix 0.5
    > mc-spec blog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2
    > net-mc blog.net 1

For the sampling phase, we could also try using commands similar to
those presented above, but we might as well try something different
(the mc-spec command needed here is long, so it is continued by
putting a "\" at the end of the first line):

    > mc-spec blog.net repeat 10 sample-sigmas heatbath 0.95 \
                             hybrid 100:10 0.3 negate
    > net-mc blog.net 200

The above 'mc-spec' command specifies the variant of hybrid Monte
Carlo with "persistence" in which relatively short trajectories are
used, but random walks are suppressed by only partially replacing the
momentum in the 'heatbath' step.  Note that for this to work well, the
rejection rate must be quite small.  This alternative method has the
advantage that it allows for more frequent hyperparameter updates
(with 'sample-sigmas').

On our SGI machine, the 200 iterations requested above take about 45
minutes.  During this time, one can monitor the simulation, for
instance with the command:

    > net-plt t h1h2h3 blog.net | plot

where again, 'plot' is some appropriate plotting program, which in
this case must be capable of plotting three superimposed graphs, for
the three hyperparameters h1, h2, and h3.  The values for these
hyperparameters exhibit quite a high degree of autocorrelation, which
is why it is advisable to allow the simulation to go for 200
iterations, in order to be more confident that the simulation has
explored all high-probability regions of the posterior distribution.

Once the run has finished, we can make predictions using 'net-pred',
based, say, on the networks from the final 3/4 of the run (which is a
generally reasonable portion).  For a problem such as this, where the
response is binary, we are most likely interested in guessing whether
the target is "0" or "1", with performance measured by how often we
are right.  We can get such predictions as follows:

    > net-pred itm blog.net 51: 

    Number of networks used: 150

    Case  Inputs         Targets  Guesses  Wrong?

       1   -1.56   0.90    0.00       0       0
       2    0.09  -0.13    1.00       1       0
       3   -0.79   0.85    0.00       0       0
       
               ( midde lines omitted )
    
     198    1.49   0.70    0.00       0       0
     199   -2.23  -0.28    0.00       0       0
     200   -0.91  -0.03    0.00       0       0

    Fraction of guesses that were wrong:  0.0900+-0.0203

The "itm" options ask for output listing the inputs, the targets, and
the guesses based on the mode of the predictive distribution.  A
summary is also printed that reports the fraction of guesses that were
wrong, along with the standard error for this estimate of the error
rate.


A Gaussian process model for the binary response problem.

This binary data can be modeled using a Gaussian process as well.  The
Gaussian process specifies a distribution over real-valued functions.
The function value for the inputs in a particular case can be passed
through the logistic function, f(x) = 1/(1+exp(-x)), to give the
probability that the target in that case will be '1' rather than '0'.

The specifications for a Gaussian process model analogous to the
network model used above are as follows:

    > gp-spec blog.gpl 2 1 1 - 0.1 / 0.05:0.5 0.05:0.5
    > model-spec blog.gpl binary
    > data-spec blog.gpl 2 1 2 / bdata@1:300 . bdata@301:500 .
    Number of training cases: 300
    Number of test cases: 200

The 'gp-spec' command used here does not quite correspond to the
'net-spec' command used above.  The following command would give a
closer correspondence:

    > gp-spec blog.gpl 2 1 100 / 0.05:0.5 0.05:0.5

The reason for modifying this, by reducing the constant part of the
covariance (from 100 to 1) and introducing a small (0.1) "jitter"
part, is to solve computational problems of poor conditioning and slow
convergence.  The "jitter" part lets the function value vary randomly
from case to case, as if a small amount of Gaussian noise were added.
This makes the function values less tightly tied to each other, easing
the computational difficulties (with the unmodified specification,
'gp-mc' would probably work only if the dataset was very small).
These changes have little effect on the statistical properties of the
model, at least for this problem.

Sampling might now be done using the following commands:

    > gp-gen blog.gpl fix 0.5 1
    > mc-spec blog.gpl repeat 4 scan-values 200 heatbath hybrid 6 0.5
    > gp-mc blog.gpl 50

These commands (and variations on them) are similar to those that you
might use for a regression problem, except that a "scan-values 200"
operation has been included.  This operation performs 200 Gibbs
sampling scans over the function values for each training case.  
An explicit representation of function values is essential for a
binary model, though not for a simple regression model, because the
relationship of function values to targets is not Gaussian.
Unfortunately, for the very same reason, the direct "sample-values"
operation is not available, so Gibbs sampling must be done.

The 50 sampling iterations take about 50 minutes on our SGI machine.
While you wait, you can check the progress of the hyperparameters
using gp-display:

    > gp-display blog.gpl

    GAUSSIAN PROCESS IN FILE "blog.gpl" WITH INDEX 10

    HYPERPARAMETERS

    Constant part:

          1.000

    Jitter part:

            0.1

    Exponential parts:

          6.657
          0.179 :      0.179      0.179

Note that there are only two variable hyperparameters, since the
constant and jitter parts are fixed, and the relevance hyperparameters
for each input are tied to the higher-level relevance hyperparameter.
A time plot of two variable hyperparameters can be obtained as follows:

    > gp-plt t S1R1 blog.gpl | plot

Once sampling has finished, predictions for test cases can be made
using 'gp-pred'.  The following command prints only the average
classification performance:

    > gp-pred ma blog.gpl 21:%2

    Number of iterations used: 15

    Number of test cases: 200

    Fraction of guesses that were wrong:  0.0900+-0.0203

This takes about one minute on our SGI machine.  This is considerably
slower than prediction using a network model, since a covariance
matrix must be inverted for each iteration used (saving the inverses
for each iteration would take up lots of disk space).  The accuracy
achieved is about the same as the network model, which is not too 
surprising, given that they both use a logistic function to relate
function values to target probabilities.

It is possible to define a Gaussian process model that will behave as
what is known is statistics as a "probit" model rather than a
"logistic" model.  This is done by simply setting the amount of jitter
to be large compared to the span over which the logistic function
varies from nearly 0 to nearly 1.  This can be done by changing the
'gp-spec' command to the following, in which the jitter part is 10
(the constant part is also changed to be of the same magnitude):

    > gp-spec blog.gpp 2 1 10 - 10 / 0.05:0.5 0.05:0.5

One can then proceed as above.  The final performance is similar, but
the scale hyperparameter takes on much larger values.