```

A REGRESSION PROBLEM WITH OUTLIERS

Finally, we will go back to the simple regression problem we started
with, but now some of the cases will be "outliers", for which the
noise is much greater than for normal cases.

In this synthetic data, the input variable, x, again had a standard
Gaussian distribution and the corresponding target value came from a
distribution with mean given by

0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2)

standard deviation 0.1.  However, with probability 0.05, a case is an
"outlier", for which the standard deviation was 1.0 instead.

I generated 200 cases in total, stored in the file 'odata'.  The first
100 of these cases are meant for training, the second 100 for testing.
It is also possible to test on 'rdata', to see how well the function
learned predicts data that is never corrupted by high noise.

A neural network model for regression with outliers.

One way to model data with "outliers" is to let the noise level vary
from one case to another.  If the noise for the outlier cases is set
to be higher, they will end up having less influence on the function
learned, as is desirable.  The software allows the noise variance for
a case to vary according to an inverse gamma distribution.  This is
effectively the same as letting the noise have a t-distribution rather
than a Gaussian distribution.

The commands used to do this are as follows:

> net-spec olog.net 1 8 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100
> model-spec olog.net real 0.05:0.5::4
> data-spec olog.net 1 1 / odata@1:100 . odata@101:200 .

> net-gen olog.net fix 0.5
> mc-spec olog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2
> net-mc olog.net 1

> mc-spec olog.net sample-sigmas heatbath hybrid 1000:10 0.4
> net-mc olog.net 400

The crucial difference is in the 'model-spec' command, where the noise
prior of 0.05:0.5::4 specifies that the per-case noise precision
(inverse variance) follows a gamma distribution with shape parameter
of 4.  When this is integrated over, a t-distribution with 4 degrees
of freedom results.  This t-distribution is by no means an exact model
of the way the noise was actually generated, but its fairly heavy
tails are enough to prevent the model from paying undue attention to
the outliers.

The above commands take 165 seconds on the system used (see
Ex-system.doc).  The resulting model can be tested on data from the
same source using net-pred:

> net-pred na olog.net 101:

Number of iterations used: 300

Number of test cases: 100

Average squared error guessing mean:   0.01917+-0.01057

One can also see how well the model does on the uncorrupted data that
was used originally:

> net-pred na olog.net 101: / rdata@101:200 .

Number of iterations used: 300

Number of test cases: 100

Average squared error guessing mean:   0.00927+-0.00124

This is similar to the results obtained earlier with the model trained
on uncorrupted data (slightly better, in fact, but that's just luck).

In contrast, the results are substantially worse when the data with
outliers is used to train a standard model where the noise is
Gaussian, with the same variance for each case.

A Gaussian process model for regression with outliers.

Gaussian process regression can also use a t-distribution for the
noise, specified using 'model-spec', as above.  Implementation of this
model requires sampling for function values in training cases, so a
small amount of "jitter" will almost always have to be included in the
covariance function.  A "sample-variances" operation must also be
specified in 'mc-spec', to allow the case-by-case noise variances to
be sampled.  The following commands illustrate how this is done:

> gp-spec olog.gpt 1 1 1 - 0.001 / 0.05:0.5 0.05:0.5
> model-spec olog.gpt real 0.05:0.5::4
> data-spec olog.gpt 1 1 / odata@1:100 . odata@101:200 .

> gp-gen olog.gpt fix 0.5 0.1
> mc-spec olog.gpt sample-variances heatbath hybrid 20:4 0.5
> gp-mc olog.gpt 200

This takes 30 seconds on the system used (see Ex-system.doc).  The
progress of the run can be monitored by examining the case-by-case
noise standard deviations in (say) the first 8 training cases, as
follows:

> gp-plt t v@0:7 olog.gpt | plot

Once the run has converged, a few of these standard deviations (for
cases that are outliers) should be much bigger than the others.  The
noise standard deviations can also be examined using the "-n" option
of 'gp-display'.

Predictions can be made using 'gp-pred':

> gp-pred na olog.gpt 101:%5

Number of iterations used: 20

Number of test cases: 100

Average squared error guessing mean:   0.01939+-0.01134

This performance is very similar to that of the network model.
```