NET-GD:  Train a network by gradient descent in the error.

Net-gd trains the parameters of a network by old-fashioned gradient
descent in the error (which is minus the log likelihood, plus minus
the log prior).  The hyperparameters are not updated by this program.
A scheme using differential stepsizes for weights out of different
units can be used (in combination with "early stopping") to try to get
some of the advantages of a hierarchical Bayesian model.

Computaton of the network model log likelihood and its gradient may be
done on a GPU (except for survival models), if a version of net-gd
compiled for a GPU is used.

Usage:

    net-gd log-file ["@"]iteration [ save-mod ] / stepsize { stepsize } 
           [ method [ group { group } ] ]

Gradient descent in the error is done starting with the last iteration
saved in the log file and continuing up to the indicated iteration (or
if the iteration is immediately preceded by "@", until a total of
iteration minutes of cpu-time have been used, for all iterations).
Results of the simulation are appended to the log file for iterations
that are divisible by save-mod, which defaults to one (every iteration
saved).  

If the log file does not contain a network to start with, a network is
randomly created in which the parameters are drawn from the interval
(-0.01,+0.01).  The hyperparameters are set to the centres of their
priors.

Stepsizes for the gradient descent procedure must be specified. If
just one stepsize is given, it applies to all parameters.  If more
than one is given, there must be one stepsize for each group of
parameters, where a group corresponds to one "prior" argument of
net-spec, and to one section of the output of net-display.  The
stepsizes are all scaled down by the number of training cases plus one
before being used.

The gradient descent method can be either "online" (the default) or
"batch".  For simple online gradient descent, each iteration consists
of as many updates as there are training cases, with each update being
based on the gradient due to one training case (taken in sequence),
plus the prior gradient divided by the number of training cases.  For
simple batch gradient descent, each iteration consists of one update
of the parameters, using the total gradient based on all training
cases (and the prior gradient).  For very small stepsizes, the batch
and online methods should produce the same results.  For larger
stepsizes, the online method is usually faster, and it can also often
use a larger stepsize without causing instability.  On-line gradient
descent does not find the exact minimum, however, except in the limit
as the stepsize goes to zero.

Differential stepsizes can be used for weights in specified groups
that originate in different units - for example, in order to try to
mimic the effect of the Bayesian "Automatic Relevance Determination"
model.  The group indexes (from 1) for which this should be done are
listed after the method (which must therefore not be left to default).
Stepsizes for weights in these groups are found by computing or
estimating the magnitude of the gradient for weights in this group
that originate in each of the source units.  The stepsize for the
weights out of a unit is the stepsize specified for the group, times
the magnitude of the gradient for weights out of that unit raised to
the fourth power, divided by the maximum of this fourth-power
magnitude over all units in the group.  For batch gradient descent,
the gradient magnitude for weights out of a unit is based on the total
gradient for all training cases, plus the prior.  For online gradient
descent, the same is used for the first iteration, but thereafter, the
total gradient for the previous pass (plus the prior) is used, even
though this sum is based on different parameter values for different
cases.

Differential stepsizes may not be used for groups of connections
specified with a configuration file.

Note that gradient descent learning with essentially no prior may be
done by specifying the "priors" using just "+" and "-" (as described
in net-spec.doc).  However, if no prior is used, it may be necessary
to use one of the networks found before convergence of the gradient
descent procedure has been reached ("early stopping"), selected on the
basis of the error on a validation set.  The differential stepsizes
are intended to improve the performance of early stopping, by stopping
the weights out of some units from overfitting before the weights out
of other units have had time to adjust to their proper values.

            Copyright (c) 1995-2004 by Radford M. Neal