THE MLP-BGD-3 METHOD
Regression with multilayer perceptron networks
trained using batch gradient descent with adaptive,
with static input-dependent learning rates
Radford M. Neal, 23 July 1997
This method is the same as mlp-bgd-1, except that the learning rates
(stepsizes) for weights on connections out of input units are chosen
based on correlations with the target, in an attempt to make weights
from the more relevant inputs learn faster, thereby improving the
performance of early stopping. This method differs from mlp-bgd-2 in
that the learning rates are chosen once and for all at the beginning,
not adapted dynamically during learning. There must be a single
real-valued target for this method.
The effect learning rates (stepsizes) are actually set indirectly, by
scaling the inputs. Scaling an input by a factor of f has the same
effect as scaling the stepsize for that input by a factor of f^2,
since the gradient is multiplied by f, and the size of weight needed
to get the same effect is multiplied by 1/f.
The scaling for an input, xi, is computed from the average value over
the entire training set of xi*t, where t is the target value. The
method is used with the standard DELVE encodings, which normalize the
inputs and targets, based on the median and average absolute
deviation. Since this will usually be close to normalizing based on
mean and standard deviation, the average value of xi*t will be close
to the correlation of the input with the target.
Once ci, the average value of xi*t, is computed for each input, inputs
i is scaled by ci^2 divided by the maximum value of ci^2 over all the
inputs, except the scaling factor is set ot 0.01 if it would otherwise
be less than this. The effect is similar to the initial adjustment of
learning rates in mlp-bgd-2, but for mlp-bgd-3, the scaling is fixed
during learning.
All other aspects of the procedure are the same as for mlp-bgd-1 and
mlp-bgd-2. The "runr" shell used to implement the method works the
same way as for mlp-bgd-1 and mlp-bgd-2. The "corrscale" program is
used to produce a set of arguments to data-spec that implement the
scaling.