Background of KNN

The KNN algorithm uses a majority voting mechanism. It collects data from a training data set, and uses this data later to make predictions for new records.

For each new record, the k-closest records of the training data set are determined. Based on the value of the target attribute of the closest records, a prediction is made for the new record.

The basic nearest neighbor (NN) algorithm makes classification predictions or regression predictions for an arbitrary instance. To this purpose, the NN algorithm identifies a training instance that is closest to the arbitrary instance. Then, the NN algorithm returns the class label or target function value of the training instance as the predicted class label or target function value for the arbitrary instance.

The KNN algorithm expands this process by using a specified number k≥1 of the closest training instances instead of using only one instance. Typical values range from 1 to several dozens.

The output depends on whether you use the KNN algorithm for classification or regression.

  • In KNN classification, the predicted class label is determined by the voting for the nearest neighbors, that is, the majority class label in the set of the selected k instances is returned.
  • In KNN regression, the average value of the target function values of the nearest neighbors is returned as the predicted value.

By using a specified number k≥1, you can control the tradeoff between overfitting prevention and resolution. Overfitting prevention might be important for noisy data. Resolution might be important to get different predictions for similar instances.