K-Nearest Neighbour(KNN)

Published On: 29 April 2021.By Auriga User.

Digital Engineering
General

K-Nearest Neighbour(KNN)

What is KNN Algorithm ?

K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.

How does k-Nearest Neighbors work?

K-nearest neighbors (KNN) algorithm uses the technique ‘feature similarity’ or ‘nearest neighbors’ to predict the cluster that a new data point fall into. Below are the few steps based on which we can understand the working of this algorithm better

Step 1 − For implementing any algorithm in Machine learning, we need a cleaned data set ready for modelling. Let’s assume that we already have a cleaned dataset which has been split into training and testing data set.

Step 2 − As we already have the data sets ready, we need to choose the value of K (integer) which tells us how many nearest data points we need to take into consideration to implement the algorithm.

Step 3 − This step is an iterative one and needs to be applied for each data point in the dataset.

I. Calculate the distance between test data and each row of training data using any of the distance metric

a. Euclidean distance– Euclidean distance is the square root of the sum of squared distance between two points.

b. Manhattan distance– Manhattan distance is the sum of the absolute values of the differences between two points.

c. Minkowski distance–We can calculate Minkowski distance only in a normed vector space, which means in a space where distances can be represented as a vector that has a length and the lengths cannot be negative.

d. Hamming distance.– Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different. The Hamming distance method looks at the whole data and finds when data points are similar and dissimilar one to one. The Hamming distance gives the result of how many attributes were different.

II. We need to sort the data based on the distance metric that we have used in the above step.

III. Choose the top K rows in the transformed sorted data.

IV. Then it will assign a class to the test point based on most frequent class of these rows.

How to determine the K value?

We need to select an appropriate K value to in order to achieve the maximum accuracy of the model, but there are no pre-defined statistical methods to find the most favorable value of K. So we will use Elbow Method.

Elbow method starts with computing the Sum of Squared Error (SSE) for some values of k. The SSE is the sum of the squared distance between each member of the cluster and its centroid.

SSE=∑Ki=1∑x∈cidist(x,ci)2SSE=∑∑x∈cidist(x,ci)2

If you plot different values of k against the SSE, we can see that the error decreases as the value of k gets larger, this happens because when the number of clusters increases, the clusters will tend to become smaller, so distortion will also be smaller. The idea of the elbow method is to choose the k at which the SSE decreases suddenly signifying the shape of elbow.

In some cases, there are more than one elbow, or no elbow at all. In such cases we usually end up calculating the best k by evaluating how well k-means ML Algorithm performs in the context of the problem you are trying to solve.

Applications of KNN

Whether to sanction a loan or not? to a candidate.
Classifying given transaction is fraudulent or not.
Recommendation System (YouTube, Netflix)
Handwriting detection (like OCR).
Image recognition.
Video recognition.

Pros and Cons of KNN

Pros

Easy to use, understand and interpret.
Quick calculation time.
No assumptions about data.
High accuracy of predictions.
Versatile – Can be used for both Classification and Regression Business Problems.
Can be used for Multi Class Problems as well.
We have only one Hyper parameter to tweak at Hyperparameter Tuning step.

Cons

Computationally expensive and requires high memory as the algorithm stores all the training data.
The algorithm gets slower as the variables increase.
It is very Sensitive to irrelevant features.
Curse of Dimensionality.
Choosing the optimal value of K.
Class Imbalanced dataset will cause problem.
Missing values in the data also causes problem.