THE BASE-1 METHOD
Base-line prediction using means, medians, or class frequencies
Radford Neal, 6 May 1996
The base-1 method is intended to provide a base-line of performance
that can be obtained by completely ignoring the inputs attributes,
basing prediction solely on simple statistics regarding the targets in
training cases - namely, the mean and median of the training targets,
when these targets are numeric, and the frequencies of classes in the
training set, when the targets are categorical.
The method is appropriate for use with all loss functions except log
probability loss ("L"). Currently, the method cannot be used when
there is more than one target attribute.
The base-1 method is implemented using two programs, "baser" and
"basec", with "baser" being used for tasks where the targets are
numeric, and "basec" being used when the targets are categorical.
Numeric targets for use with "baser" can be encoded with or without
normalization (it should make no difference to the result).
Categorical targets for use with "basec" should be encoded in "0-up"
form - ie, with the classes encoded as integers 0, 1, 2, etc., in
the same order as they are listed in the dataset specification.
In detail, the two programs operate as follows:
BASER - BASE-LINE PREDICTION FOR NUMERIC TARGETS USING MEAN AND MEDIAN
Usage:
baser instance
Reads cases from train.n, where n is the instance number given as the
argument, ignoring all but the last number on each line, which
should be the target in that training case (a number). Also reads a
file of inputs for test cases from test.n, which it completely ignores,
except to count how many test cases there are.
Writes two files, each having as many lines are there are test cases,
with each line being the same. The lines in the file cguess.S.n contain
the most mean of the training targets, which is a reasonable guess for
squared-error loss. The lines in the file cguess.A.n contain the
median of the training targets, which is a reasonable guess for
absolute-error loss. When the number of targets is even, the median is
the average of the two middle targets.
This method does not produce the predictive distributions that would be
required for evaluation by log probability loss.
BASEC - PREDICTION FOR CLASS TARGETS USING BASE RATES
Usage:
basec #classes instance
Reads cases from train.n, where n is the instance number given as the
second argument, ignoring all but the last number on each line, which
should be the class of that training case (a number from 0 up to the
number of classes minus 1). Also reads a file of inputs for test cases
from test.n, which it completely ignores, except to count how many test
cases there are.
Writes two files, each having as many lines are there are test cases,
with each line being the same. The lines in the file cguess.n contain
the most frequent class from the training data, with ties resolved by
picking the lower-numbered class. This is a reasonable guess for 0-1
loss. The lines in the file prob.n contain the probabilities of the
classes, estimated by the frequencies of the classes in the training set.
The probability for a class can be zero, if the class does not occur
as the target for any case in the training set. This may make this
method unsuitable when log probability loss is being used, as the
loss can be infinite.