knn-class-1: K nearest neighbour classifier

knn-class-1 is a K-nearest neighbour (K-NN) implementation for classification tasks (ie ones in which all the target attributes are categorical). The present version assumes that there is only a single categorical output.

The underlying program is fairly general and can accept various user specified options. Here I will describe the options used when run as knn-class-1 in the DELVE environment.

Overview:

The K-NN algorithm is very simple. Given a set of labelled training examples, T, and an unseen test point, x, compute the squared distance in input space from x to each of the examples in T. Discard all, training cases except for the K cases which are closest to the test point. Assign the test point to the most numerous class amongst these K-NN. A probabilty distribution over classes is also easily constructed. The probabilty of class i (p_i) is simply the ratio:

p_i = n_i / sum_j { n_j}
where n_j is number of nearest neighbours in class j.

Loss functions.

Both zero-one and squared-probability loss are supported

Picking K.

A leave-one-out cross-validation scheme is employed to pick suitable values of K for a given training set. The zero-one loss is used to pick K for 'guesses', while the squared-probability loss is used to pick K for producing probability distribution predictions.

Breaking Ties:

When forced to guess a single class, ambiguities can occur when more than 1 class all have the largest number of Nearest Neighbours to the test point. These ties are broken by picking the class with the smallest summed squared distance to the winning point.

Avoiding numerical inconsistencies:

Results can become implementation dependent in the situation where the ... k-1, k, k+1, ... neighbours all have squared distances from the test point that are within epsilon of each other, where epsilon is the machine accuracy. There is then ambiguity as to which K neighbours are chosen which in turn will affect predictions. In attempt to avoid this undesirable situation, we detect it, and situate all ... k-1, k, k+1, .. points at the same distance d, where d is the average squared distance of these points. We also down-weight the count of these examples, so that the sum over all neighbours still yields a total count of K.

Running the program:

To get on-line help type :

unix> knnClass -h

Class coding: The program assumes that the class is coded as an integer. In Delve terminology this means that the class should be coded using the 0-up encoding (rather than the default, 1-of-n, for categorical variables).

As an aid to running under Delve, I use the script knn-class-1_sh, which is supplied with the program source. This script selects the options that were used to generate the results in this hierarchy.

Last Updated 5 November 1996
Comments and questions to: delve@cs.toronto.edu