Hands on k-Nearest Neighbour Algorithm - (k-NN)


From this post onward I am trying to explain Data mining and Machine learning algorithms which I have tried out for some industrial solutions during my professional career as a data scientist and kick off this initiative with a simple machine learning algorithm which is called as “k-Nearest Neighbour”. 
The k-NN algorithm is categorized as an Instanced Based Algorithm and known as a classification or Supervised Learning Algorithm.

What is k-NN Algorithm?


As the name itself describes the methodology, it looks for k number of nearest-neighbours when it requires to predict some labels for a given unknown data point and consider the majority voting.
It does not refer a model for a prediction on unknown data, hence there is no learning requires, other than storing entire training data set. That means it makes predictions using the training data set directly and predict class label for an unknown instance by searching through the entire set for the k-most similar neighbors.
As an example let’s consider 2-D data space, that has 5 data points labeled as triangles (green) and 8 data points labeled as circles (red). Suppose that we need to find the label of an unknown data point (let’s take it as a square (blue) ) considering the distribution.





At a glance, you may able to predict it as a triangle data point. 
We can predict the unknown label as follows either consider,
  1. Consider k-nearest neighbors and get majority voting among candidates
  2. Consider k radius circular space and get majority voting among candidates
Consider k-nearest neighbors 

In this approach, the algorithm calculates the distance between given data point and each and every data point in the training set and consider k number of neighbors among them as shown in the diagram. ( consider k=5)





Consider k radius circular space

Here it considers data points which are inside a circle that has k unit radius and then consider majority voting among candidates. 



Efficiency &  Accuracy Improvements 


Since the algorithm scans through all training elements in order to generate results, time complexity increases with the size of the data set. 
Suppose we have n examples each of dimension d algorithm takes O(nd) to compute distances to all examples and takes O(ndk) time to find k closest examples. So the total time complexity of the algorithm is O(ndk).
Thus several approaches had taken to improve the efficiency as well as the accuracy of the algorithm.

Efficiency - 
 A data structure used for organizing some number of points in a space with k dimensions. It is a binary search tree with other constraints imposed on it. K-d trees are very useful for range and nearest neighbor searches. This partition space recursively and search for Nearest Neighbour only close to the test point.
 An artificial neural network algorithm that computes representatives (codebooks) for a given training set and it reduces the sample size by introducing codebook vectors as training data for the algorithm.
  •  Reducing the dimension size
Accuracy - 
  • Use a proper distance calculation 
For the demonstration purpose I have used the most popular distance measure - Euclidean distance. It calculated as follows,
Euclidean_Distance(x, xi) = sqrt( sum( (xj – xij)^2 ) )
Where, 
x : new data point 
xi : existing point
j   : input attributes

This algorithm can be implemented using other popular distance calculation approaches like Hamming Distance , Manhattan Distance ,Minkowski Distance, Tanimoto Distance, Jaccard Index, Cosine Similarity and Kernel Methods based on the properties of the data set. If it is difficult to select the proper distance calculation with your data, you can try with different distance metrics with different k values and select the most accurate approach. 
Consider distance of a neighbor with a weigh the contribution of each of the k-neighbours to their distance to the given instance (x)  that results from greater weight (w) when a neighbor more close to the given instance (x). 
  • Feature weighted Algorithm
Some features (dimensions) may be much more discriminative than other features. But the Euclidean distance treats each feature as equally important and will reduce the accuracy of the predictions. So it is good to scale each feature by its importance for classification.

The sample implementation of k-Nearest Neighbor Algorithm - Git .


Note : 
Since the k-NN algorithm stores the entire training data set, you may want to consider the consistency of your training data. 

No comments:

Post a Comment