A PROBLEM WITH A BINARY RESPONSE

In this example, each case consists of two real-valued inputs, x1 and
x2, and a binary target.  Data was generated by drawing the inputs for
a case independently from standard Gaussian distributions.  The target
was then set to "1" with the following probability:

    exp ( - ((x1-0.4)^2 + (x2+0.3)^2) ^ 2 )

ie, the negative exponential of the fourth power of the distance of
(x1,x2) from (0.5,-0.3).  I generated 500 cases, and used the first
300 for training and the remaining 200 for testing.


A neural network model for the binary response problem.

We will try modeling this data using a network with one layer of 15
hidden units and a single output unit.  For a binary target, a
Gaussian data model would be inappropriate; instead, we can use a
logistic regression model, in which the probability of the target
being "1" is obtained by passing the value of the output unit through
the logistic function, f(x)=1/(1+exp(-x)).

The network, model, and data specification commands needed are quite
similar to those used in the regression problem (see Ex-netgp-r.doc):

    > net-spec blog.net 2 15 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 
    > model-spec blog.net binary
    > data-spec blog.net 2 1 2 / bdata@1:300 . bdata@301:500 .
    Number of training cases: 300
    Number of test cases: 200

The 'net-spec' command here differs only in the number of input units
(2) and the number of hidden units (15).  The data model used is
"binary", with no need for a noise level prior.

The 'data-spec' command also says that there are two inputs, and one
target.  It also has a third argument of "2" just before the "/",
which indicates that the target must be an integer with two possible
values (which are "0" and "1").

The initial phase commands that were used for the simple regression
problem turn out to be adequate for this problem as well:

    > net-gen blog.net fix 0.5
    > mc-spec blog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2
    > net-mc blog.net 1

(The sample-noise operation above is actually unnecessary, since there
is no noise for this model.)  For the sampling phase, we could also
try using commands similar to those presented above, but we might as
well try something different (the mc-spec command needed here is long,
so it is continued by putting a "\" at the end of the first line):

    > mc-spec blog.net repeat 10 sample-sigmas heatbath 0.95 \
                             hybrid 100:10 0.3 negate
    > net-mc blog.net 200

The above 'mc-spec' command specifies the variant of hybrid Monte
Carlo with "persistence" in which relatively short trajectories are
used, but random walks are suppressed by only partially replacing the
momentum in the 'heatbath' step.  Note that for this to work well, the
rejection rate must be quite small.  This alternative method has the
advantage that it allows for more frequent hyperparameter updates
(with 'sample-sigmas').

On our 550MHz Pentium III, the 200 iterations requested above take
about eleven minutes.  During this time, one can monitor the
simulation, for instance with the command:

    > net-plt t h1h2h3 blog.net | plot

where again, 'plot' is some appropriate plotting program, which in
this case must be capable of plotting three superimposed graphs, for
the three hyperparameters h1, h2, and h3.  The values for these
hyperparameters exhibit quite a high degree of autocorrelation, which
is why it is advisable to allow the simulation to go for 200
iterations, in order to be more confident that the simulation has
explored all high-probability regions of the posterior distribution.

Once the run has finished, we can make predictions using 'net-pred',
based, say, on the networks from the final 3/4 of the run (which is a
generally reasonable portion).  For a problem such as this, where the
response is binary, we might be interested in guessing whether the
target is '0' or '1', with performance measured by how often we are
right.  We can get such predictions as follows:

    > net-pred itmp blog.net 51: 

    Number of iterations used: 150

    Case  Inputs        Targets  Log Prob  Guesses  Wrong?

       1   -1.56   0.90    0.00    -0.001      0       0
       2    0.09  -0.13    1.00    -0.054      1       0
       3   -0.79   0.85    0.00    -0.003      0       0

                     (some lines omitted )

      99    0.14   0.86    1.00    -2.590      0       1
     100    0.21   0.18    1.00    -0.097      1       0
     101   -1.32  -0.19    0.00    -0.008      0       0

                     (some lines omitted )

     198    1.49   0.70    0.00    -0.028      0       0
     199   -2.23  -0.28    0.00    -0.002      0       0
     200   -0.91  -0.03    0.00    -0.055      0       0

    Average log probability of targets:    -0.245+-0.034

    Fraction of guesses that were wrong:  0.0850+-0.0198

The "itmp" options ask for output listing the inputs, the targets, the
the guesses based on the mode of the predictive distribution, and the
log of the probability of the correct target in the predictive
distribution.  A summary is also printed showing the average log
probability of the correct target and the fraction of guesses that
were wrong, along with the standard errors for these as estimates of
performance on the population of cases.  The average log probability
is an overall indication of how well the model has done.  The
classification error rate might be more directly relevant for some
applications.


A Gaussian process model for the binary response problem.

This binary data can be modeled using a Gaussian process as well.  The
Gaussian process specifies a distribution over real-valued functions.
The function value for the inputs in a particular case can be passed
through the logistic function, f(x) = 1/(1+exp(-x)), to give the
probability that the target in that case will be '1' rather than '0'.

The specifications for a Gaussian process model analogous to the
network model used above are as follows:

    > gp-spec blog.gpl 2 1 1 - 0.1 / 0.05:0.5 0.05:0.5
    > model-spec blog.gpl binary
    > data-spec blog.gpl 2 1 2 / bdata@1:300 . bdata@301:500 .
    Number of training cases: 300
    Number of test cases: 200

The 'gp-spec' command used here does not quite correspond to the
'net-spec' command used above.  The following command would give a
closer correspondence:

    > gp-spec blog.gpl 2 1 100 / 0.05:0.5 0.05:0.5

The reason for modifying this, by reducing the constant part of the
covariance (from 100 to 1) and introducing a small (0.1) "jitter"
part, is to solve computational problems of poor conditioning and slow
convergence.  The "jitter" part lets the function value vary randomly
from case to case, as if a small amount of Gaussian noise were added.
This makes the function values less tightly tied to each other, easing
the computational difficulties (with the unmodified specification,
'gp-mc' would probably work only if the dataset was very small).
These changes have little effect on the statistical properties of the
model, at least for this problem.

Sampling might now be done using the following commands:

    > gp-gen blog.gpl fix 0.5 1
    > mc-spec blog.gpl repeat 4 scan-values 200 heatbath hybrid 8 0.3
    > gp-mc blog.gpl 50

These commands (and variations on them) are similar to those that you
might use for a regression problem, except that a "scan-values 200"
operation has been included.  This operation performs 200 Gibbs
sampling scans over the function values for each training case.  An
explicit representation of function values is essential for a binary
model, though not for a simple regression model, because the
relationship of function values to targets is not Gaussian.
Unfortunately, for the very same reason, the direct "sample-values"
operation is not available, so one must use Gibbs sampling or
Metropolis-Hastings updates (the met-values or mh-values operations).

The 50 sampling iterations take about thirteen minutes on our 550MHz
Pentium III.  While you wait, you can check the progress of the
hyperparameters using gp-display:

    > gp-display blog.gpl

    GAUSSIAN PROCESS IN FILE "blog.gpl" WITH INDEX 10

    HYPERPARAMETERS

    Constant part:

          1.000

    Jitter part:

            0.1

    Exponential parts:

         10.418
          0.486 :      0.486      0.486

Note that there are only two variable hyperparameters, since the
constant and jitter parts are fixed, and the relevance hyperparameters
for each input are tied to the higher-level relevance hyperparameter.
A time plot of two variable hyperparameters can be obtained as follows:

    > gp-plt t S1R1 blog.gpl | plot

Once sampling has finished, predictions for test cases can be made
using 'gp-pred'.  The following command prints only the average
classification performance:

    > gp-pred ma blog.gpl 21:%2

    Number of iterations used: 15

    Number of test cases: 200

    Fraction of guesses that were wrong:  0.0950+-0.0208

This takes about about five seconds on our machine.  This is slower
than prediction using a network model, since the Cholesky
decomposition of the covariance matrix must be found for each
iteration used (saving these would take up lots of disk space).  The
accuracy achieved is about the same as for the network model, which is
not too surprising, given that they both use a logistic function to
relate function values to target probabilities.

It is possible to define a Gaussian process model that will behave as
what is known is statistics as a "probit" model rather than a
"logistic" model.  This is done by simply setting the amount of jitter
to be large compared to the span over which the logistic function
varies from nearly 0 to nearly 1.  This can be done by changing the
'gp-spec' command to the following, in which the jitter part is 10
(the constant part is also changed to be of the same magnitude):

    > gp-spec blog.gpp 2 1 10 - 10 / 0.05:0.5 0.05:0.5

One can then proceed much as above (though it turns out that a larger
stepsize can be used).  The final performance is similar, but the
scale hyperparameter takes on much larger values.