Carl Edward Rasmussen
May 1, 1996
The linear model is one of the most popular models used in data analysis. The reasons for its popularity is that it is both conceptually and computationally simple to fit a linear model and the resulting model is easily given an intuitive representation. The most prominent deficiency of the linear model, is its strong assumptions about the true relationship in the data; for data which do not conform well to a linear model, predictions may be inaccurate.
I have implemented the linear model which I will call lin-1. The strategy of fitting a linear model is well known and the only heuristics necessary is a principled way of handling the situation where the linear system used to determine the parameters of the model is close to singular.
For simplicity, I will initially assume that the targets are scalar; later on a simple extension to handle multiple outputs will be given. The linear model is defined as
(1)
where are the m+1 model parameters, being the bias and are the inputs, augmented by to take care of the bias. The model is fit to the training data by maximum likelihood, assuming zero mean Gaussian noise with variance . The likelihood is
(2)
The maximum likelihood estimate for the weights is independent of and is found by minimizing the cost function
(3)
with respect to the model parameters. Notice, that the Gaussian noise assumption gives rise to a squared error cost function; this cost function will always be used regardless of whatever loss function we chose to evaluate the linear model. The solution to this optimization problem is well known from linear algebra; if the solution is unique, then
If is ill-conditioned numerical evaluation of eq. (4) may be troublesome, and even if fairly accurate solutions could be obtained these would not necessarily lead to good model predictions. In an attempt to define a method which has a high degree of reproducibility I propose to chose w such that directions in input space with insufficient variation in the training set are ignored by the model. This can conveniently be computed using singular value decomposition (SVD), see for example [Press et al.1992]. The decomposition is , where and are orthonormal matrices and is a vector of length m+1 containing the singular values of . A regularised can be computed by setting for those i whose are too small in
The exact criterion for regularisation is: set whenever , ie, whenever is close to singular. The constant of is chosen as a rather conservative estimate of machine precision which will not interfere when A is well-conditioned. The condition number of depends on the scale of the inputs, so the procedure should always be used in conjunction with the standard normalisations provided by DELVE. The (modified) maximum likelihood weights can then be computed as
We now derive the predictive distribution for the model. For simplicity we will derive the predictive distribution for a fixed value of the noise, and only account for uncertainty in the predictions arising from the estimated noise inherent in the data and from uncertainty in the estimate for the weights. The noise is estimated by
where k is the number of parameters in which were fit by the data; this is m+1 minus 1 for every singular value whose reciprocal was set to zero in eq. (5). This estimate may break down if there are too few training cases (if ), in which case we can't compute a predictive distribution. Since the prior on weights is uniform (non-existent), the posterior for the weights is proportional to the likelihood. Thus, the predictive distribution for a test case with input is
(8)
This Gaussian integral can be solved exactly; we get a Gaussian predictive distribution with mean and variance:
The optimal point predictions for any symmetric loss function are given by . Consequently, these predictions can be used for both absolute error and squared error loss functions. For negative log density loss function, we compute the log of the density of the targets under the predictive Gaussian distribution
For tasks that have multiple outputs we can re-use the decomposition of A from eq. (5) for every set of weights. The maximum likelihood weights and inherent noise estimates can be found by using eq. (6) and (7) for each target attribute, and point predictions can be obtained from the maximum likelihood weights as before. For the log density predictions we assume that the joint density for the targets is Gaussian with a diagonal covariance matrix, such that the density of the targets can be obtained by summing (in the log domain) contributions of the form in eq. (10).
This completes the definition of the linear model; following this recipe the model will always be uniquely defined from the data. Many elaborations to this basic linear model exist, but they will not be pursued further here.
The computational complexity involved in fitting lin-1 is , for computing and decomposing the matrix respectively. Even for fairly large tasks this can usually be considered trivial.