Carl Edward Rasmussen
May 2, 1996
Early stopping is a technique for avoiding overfitting in models that use iterative learning procedures. One uses a model which is expected to have a too large capacity and through early stopping one ensures that the model does not overfit. Consequently this approach should be helpful in overcomming the bias/variance dilemma, since only an upper bound on model capacity is required rather than a good estimate of the optimal capacity.
A fraction of the training examples are held out for validation, and performance on this set is monitored while the iterative learning procedure is applied. Typically, the validation error will initially decrease as the model fits the data better and better, but later on, when the model begins to overfit, the validation error starts rising. The idea is to stop learning as soon as this minimum in validation error is achieved. For neural networks, one of the great advantages of early stopping is that the difficult question of how long to train the model vanishes.
A number of details have to be specified to make the method practically applicable. This section contains a discussion of the issues involved and the decisions that I have made for how to implement the method. I should stress that I am not claiming my choices to be optimal in any sense, rather I try to define a method which cannot in any obvious (and well documented) way be improved upon.
I use networks with a single hidden layer of hyperbolic tangent units and linear output units. All units have biases. The network is fully connected (including direct connections from inputs to outputs). The number of hidden units is chosen to be the smallest such that the number of weights is at least a large as the total number of training cases (after removal of the validation cases). We need to decide how large a fraction of the training cases to use for validation. There are two obvious conflicting interests; we want a large validation set to achieve a good estimate of performance and we want a large training set to be able to train a complex model, namely as complex a model as could be supported by the entire training data set. There seem to be no helpful theoretical results about this tradeoff (for non-linear models and finite amounts of data). I chose to use one third of the training examples (rounded down if necessary) for validation and the rest for training.
Another question is the exact criterion for when the training should stop, since we do not want to stop merely when small fluctuations in the validation error occur. A number of more or less complicated stop criteria have been proposed in the literature, but none has convincingly been shown to outperform others. Here I will use the very simple scheme of training until the iteration with smallest validation error lies 33% backwards in the run. In this way I will have trained for 50% more epochs than `necessary' and can thus be fairly confident that I have found a good minimum. I will do a validation after every line-search in the training run (training with conjugate gradients). Since the validation set is only half the size of the training set and validation only requires a forward pass, validation will typically slow down training by about 20% (since typically the line search involved in each iteration requires on average about 1.3 forward and backward passes), which I do not think is an alarming slowdown. More computer efficient strategies could be employed by validating only at longer intervals, at the risk of finding worse stopping times. If a simpler method of steepest descent with a fixed step-size was used as an optimisation technique, it would probably be sufficient to validate at longer intervals, but the conjugate gradient technique I use involves line-searches and can potentially take very large steps.
The idea of including the validation set and continuing training is appealing, since it is expected that better performance can be achieved with larger training sets. However, the problem of when to stop this additional training presents itself. One suggestion could be to train on until the same training error per example is achieved. From a theoretical point of view, this idea seems reasonable in an average case analysis, since it is expected that validation examples will have a slightly larger error initially than the training examples. However, there are practical problems; what if the validation error is already lower (per case) than the training error? And what if it is impossible to achieve the same training error using all the training data? Even if scenarios as bad as these do not occur frequently, it is still possible that the generalisation performance will often be decreased by using this procedure.
Instead of including all the training data and retraining, I will effectively use all the data by training an ensemble of networks, each of which is trained using a different validation set (chosen at random). This is a convenient way of combining the benefits of averaging by ensembles with the ability to use all examples for training. Using different validation sets for each net has the side effect that one cannot choose between networks and only include the ones with best validation error, since the variation in validation error might be due to the differing validation sets. Consequently I will include all trained nets in the final ensemble. Since the networks have many weights, they are probably not too prone to bad local minima, and selecting networks for inclusion in the ensemble is probably not necessary. A more complicated hybrid scheme with a few training iterations for each validation set could be contemplated, but for simplicity this will not be pursued further.
Additionally, we must take care of the cases where early stopping seems to stop too early or too late. In the initial phases of training it is conceivable that the validation error fluctuates quite wildly and this may cause the stopping criteria to be fulfilled (since the extra 50% epochs may be very few and thus not be a reliable indicator of local minima). To rule out this scenario, I require that each training run uses at least 2% of the total available training time and that at least 250 iterations have been done (unless the minimization algorithm converges). This will limit the number of possible members of the final ensemble to 50, which is probably adequate for getting most of the benefit from averaging. On the other hand, it may be that the validation error keeps decreasing and the stopping criteria is never satisfied. Recall that the expected behaviour of the validation error is that it should increase when overfitting sets in, but there is no guarantee that this will happen in any particular run. On the other hand, it may be that it simply takes a large number of epochs to train the network. To resolve this dilemma, I will terminate a training if it uses more than 33% of the total computational resources available, such that the final ensemble will contain at least 3 networks. Technically, some of the above criteria can be conflicting; in the actual implementation the conditions for deciding to stop the training of a particular net in order of precedence are: if the network has converged; OR if more than 33% of the allowed cpu time has been used on training this net; OR if more than 250 iterations of training has been completed AND the minimum validation error was achieved at least 33% backwards in the run AND we've used at least 2% of the total available cpu time on the current net.
This method can now be run automatically without any further specifications, except that a limit on the allowed amount of cpu time has to be specified. As a default I have chosen to let the allowed cpu time scale linearly with the number of training cases. The default allows 1.875 seconds per training case; this means that net net with 1024 training cases will be allowed to train for 32 minutes.