K-Nearest Neighbors

ml ml supervised classification knn non-parametric 1 min read

Non-parametric algorithm for classification and regression using distance metrics

It is non-parametric learning algorithm. It is mainly used for classification but also can be used for regression by averaging out the nearest value based on distance.

Process

Choose the number of k and a distance metric (Euclidean, Manhattan, Cosine etc.)
Find the k-nearest neighbors of the data record that we want to classify
Assign the class label by majority vote

The right choice of k is crucial to finding a good balance between overfitting and underfitting.

Effect of k

As k increases, variance decreases while bias increases. Conversely, as k decreases, variance increases while bias decreases.

Key Points

It is a memory-based approach that immediately adapts as we collect new training data
Computational complexity for classifying new examples grows linearly with the number of training examples (worst case)
KNN is very susceptible to overfitting due to the curse of dimensionality. Regularization methods cannot be applied here
All features should be scaled as we will be taking distance based on features
Optimization can be done through dimensionality reduction using methods like PCA, LDA etc.