knn-class-1: K nearest neighbour classifier
knn-class-1 is a K-nearest neighbour (K-NN) implementation for
classification tasks (ie ones in which all the target
attributes are categorical). The present version assumes
that there is only a single categorical output.
The underlying program is fairly general and can accept
various user specified options. Here I will describe
the options used when run as knn-class-1 in the DELVE
environment.
Overview:
The K-NN algorithm is very simple. Given a set
of labelled training examples, T, and an unseen test point,
x, compute the squared distance in input space from x to each
of the examples in T. Discard all, training cases except for
the K cases which are closest to the test point. Assign the
test point to the most numerous class amongst these K-NN. A
probabilty distribution over classes is also easily constructed. The
probabilty of class i (p_i) is simply the ratio:
p_i = n_i / sum_j { n_j}
where n_j is number of nearest neighbours in class j.
Loss functions.
Both zero-one and
squared-probability loss are supported
Picking K.
A leave-one-out cross-validation scheme is
employed to pick suitable values of K for a given training
set. The zero-one loss is used to pick K for 'guesses', while
the squared-probability loss is used to pick K for producing
probability distribution predictions.
Breaking Ties:
When forced to guess a single class,
ambiguities can occur when more than 1 class
all have the largest number of Nearest Neighbours to the test
point. These ties are broken by picking the class with the
smallest summed squared distance to the winning point.
Avoiding numerical inconsistencies:
Results can become
implementation dependent in the situation where the ... k-1,
k, k+1, ... neighbours all have squared distances from the test
point that are within epsilon of each other, where epsilon
is the machine accuracy. There is then ambiguity as to which K
neighbours are chosen which in turn will affect predictions.
In attempt to avoid this undesirable situation, we detect it,
and situate all ... k-1, k, k+1, .. points at the
same distance d, where d is the average squared distance of
these points. We also down-weight the count of these examples,
so that the sum over all neighbours still yields a total count
of K.
Running the program:
To get on-line help type :
unix> knnClass -h
Class coding: The program assumes that the class is coded as an
integer. In Delve terminology this means that the class should be coded using
the 0-up encoding (rather than the default, 1-of-n, for categorical
variables).
As an aid to running under Delve, I use the script
knn-class-1_sh, which is supplied with the program
source. This script selects the options that were used to
generate the results in this hierarchy.