Classification with k-Nearest Neighbors

K-Nearest Neighbors is perhaps the most intuitive of the learning techiniques in Machine Learning. In many ways, the approach mirrors how we intuitively estimate or assess the value of products and services. We often look at comparable items to determine the value of a product or service of interest.

The K-Nearest Neighbors classification algorithm works in the same way. It uses vector based distances to determine the closest $K$ neighbors and takes the majority class of the neighbors as the new class designation of the target observation. The visualization below illustrates the problem statement and why KNN can be an effective solution.

KNN Sample Dataset

What is KNN?

K-Nearest Neighbors (KNN) is a non-parametric method that predicts the outcome for an input based on the k most similar observations. It is described as non-parametric because it does not involve calculating parameters as linear regression does. This characteristic offers several advantages, such as eliminating the need to assume a specific distribution of the data. Its simplicity facilitates understanding and implementation, and it performs exceptionally well in capturing complex relationships, especially as the value of k increases. KNN can be adapted for both classification and regression tasks with minimal modifications.

At the heart of the KNN method is the principle that observations with similar features are likely to yield similar target variables. Therefore, to predict the class or target variable of a query input, we simply need to identify the most similar observations. But how do we determine which observations are similar to the input?

Distance Metrics

Various classes of distance metrics exist to measure the similarity between observations, with a particular focus on two of the most prevalent geometric distances: Euclidean Distance and Cosine Similarity.

1. Euclidean Distance

Euclidean distance is the most widely used geometric distance metric, measuring the straight-line distance between two points in a vector space. Mathematically, it is defined as:

$$d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$$

where $x$ and $y$ are vectors.

In a 2-D vector space, the formula expands to:

$$d(x, y) = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}$$

For instance, consider two points $A = (6.3, 4.1)$ and $B = (3.2, 1.7)$. The Euclidean distance between these points is calculated as follows:

$$d(A, B) = \sqrt{(6.3 - 3.2)^2 + (4.1 - 1.7)^2} = \sqrt{9.61 + 5.76} = \sqrt{15.37} \approx 3.92$$

Implementation in Python

import numpy as np

a = np.array([6.3, 4.1])
b = np.array([3.2, 1.7])

d = np.sqrt( np.sum( (a - b)**2 ) )
d
3.920459156782531

2. Cosine Similarity

The cosine similarity is the measure of similarity between two vectors. More precisely, it is the cosine measure of the angle between two vectors.

Mathematically, it is represented as:

$$ d(a,b) = \frac{ a*b } { |a| |b| } $$

where:
$a*b$ is the dot product of vectors
$ |a|$ and $|b|$ are the dot products of the vectors

The formular can be expanded on with summation definitions as follows:

$$ d(a, b) = \frac { \sum_{i=1}^n {a_i b_i} } { \sqrt { \sum_{i=1}^n {a_i^2}} \sqrt {\sum_{i=1}^n{b_i^2}} } $$

For example:

$$ d(a, b) = \frac { (6.3*3.2) + (4.1*1.7) } { \sqrt { (6.3^2) + (4.1^2)} \sqrt { (3.2^2) + (1.7^2) } } $$ $$ d(a, b) = \frac {27.13} { 7.516 * 3.623} $$ $$ d(a, b) = \frac {27.13} {27.2368} $$ $$ d(a, b) = .996 $$

np.dot(a, b)/(np.sqrt(np.dot(a, a)) * np.sqrt( np.dot(b, b) ))
0.9960776759242224

The K Nearest Neighbor Algorithm

Let's delve into implementing the K-Nearest Neighbors (KNN) algorithm. Below, I'll outline the process using pseudo-code, bearing in mind that this implementation is object-oriented. The approach might vary slightly in a functional programming context.

Initialization

  • We begin by initializing the features and labels of the known observations alongside the number of k-nearest neighbors metric.
  • Prepare the input for the unknown vector, specify the size of the known observations, and create a dictionary to hold the nearest k neighbors.
  • Euclidean Distance Calculation

  • Define a method to calculate the Euclidean distance, which is essential for determining the proximity between observations.
  • Core Algorithm

  • Iterate through each known observation to compute the distance between the observation and the unknown input vector
  • Store each calculated distance in a variable.
  • If the nearest neighbor dictionary currently holds fewer entries than the specified k-nearest neighbors: Add the new observation and its distance to the dictionary
  • Once the dictionary reaches or exceeds the size of k-nearest neighbors:
  • Sort the dictionary by distance in ascending order.
  • If the newly calculated distance is smaller than the largest distance in the dictionary:
  • Remove the entry with the largest distance and insert the new observation and its distance.
  • If not, disregard the new observation.
  • Result

  • Return the dictionary containing the nearest neighbors based on the smallest distances.
  • Implementation in Python

    Below is a sample object oriented implementation of the KNN Algorithm in Python.

    class KNearestNeighborClassifier:
       
        """ Implements the KNN Classifier Algorithm using Euclidean Distance """
    
        def __init__(self, features, labels, input_x, k_neighbors=3):
            # feature datasets
            self.features = features
            self.labels = labels 
    
            # values to be classified
            self.input_x = input_x
    
            # Number of observations
            self.n_obs = features.shape[1]
            
            # set number of neighbors
            self.k_neighbors = k_neighbors
            self.nearest_neighbors = {}
    
    
        def euclidean_distance(self, arr_x, arr_y):
            """ Computes the Euclidean Distance """
            return np.sqrt( np.sum( (arr_x - arr_y) ** 2 ) )
            
        
        def k_nearest_neighbor(self):
            for obs in range(self.n_obs):
                distance = self.euclidean_distance(self.input_x, self.features[:, obs])
                
                if len(self.nearest_neighbors) >= self.k_neighbors:
                    self.nearest_neighbors = {k:v for k,v in sorted( self.nearest_neighbors.items(), key=lambda x:x[1]) }
                    if distance < list(self.nearest_neighbors.values())[-1]:
                        self.nearest_neighbors.popitem()
                        self.nearest_neighbors[obs] = distance
                    
                else:
                    self.nearest_neighbors[obs] = distance
    
            return self.nearest_neighbors
        
    
        def predict_class(self):
            # compute nearest neighbors        
            self.k_nearest_neighbor()
           
            # retrieve predicted class based on votes/max counts of class of the neighbors
            predicted_labels = [ self.labels[index] for index in self.nearest_neighbors.keys()]
            majority_class = { label: predicted_labels.count(label) for label in set(predicted_labels) }
            for k, v in majority_class.items():
                if v == max(majority_class.values()):
                    return f"Majority Class: {k},  Class Votes: {majority_class}"
    

    Testing the Algorithm on Sample Data

    With the KNN class completed, I will test the implementation of the algorithm on some same data.

    import pandas as pd 
    
    dataset = pd.read_csv('sample.txt')
    dataset.head()
    X Y class
    0 1.206922 2.206818 A
    1 3.235168 1.338481 A
    2 1.778673 1.464483 A
    3 -0.725584 1.234910 A
    4 0.329946 0.033291 A

    Visualizing the dataset

    A visualization step will best illustrustate the datasets and the observation for which we are attempting to make predictions.

    import matplotlib.pyplot as plt
    
    %matplotlib inline
    %config InlineBackend.figure_format = 'retina'
    
    fig = plt.figure(figsize=(7,5), dpi=250)
    plt.scatter(dataset[dataset['class']=='A' ]['X'], dataset[dataset['class'] == 'A']['Y'], label='Class A ')
    plt.scatter(dataset[dataset['class']=='B' ]['X'], dataset[dataset['class'] == 'B']['Y'], label='Class B ')
    
    
    plt.scatter( np.array([0.5, -0.3, 0,]), np.array([0.5, -1, 1.5]), label='?', marker='h', linewidths=3, color='seagreen')
    plt.annotate('x', (0.5, 0.5 ), fontsize=8, ha='center' )
    plt.annotate('y', (-0.3, -1), fontsize=8, ha='center' )
    plt.annotate('z', (0, 1.5), fontsize=8, ha='center' )
    
    
    plt.legend(loc='upper left')
    plt.title('Distribution of Class A & B')
    Implementing K Nearest Neighbor Classification

    Executing kNN Classifier

    To run the kNN algorithm, initialize the three unknown data points as x_inputs. Also initialize the features and labels.

    # new points
    x_inputs = np.array([ [0.5, 0.5], [-0.3, -1], [0, 1.5] ])
    
    features = np.array([ dataset.X, dataset.Y  ])
    labels = dataset['class']

    Generating Predictions

    The code below initializes the KNearestNeighborClassifier and implements a loop through the new points to computes the nearest neighbor and provide a classification outcome.

    # Iterating through the new points for classification
    for x_input in x_inputs:
        print(x_input)
        knn_model = KNearestNeighborClassifier( features=features, labels=labels, input_x=x_input, k_neighbors=3)
        neighbors = knn_model.k_nearest_neighbor()
        print("The nearest neighbors are: ", neighbors)
        print(knn_model.predict_class(), '\n')
    
    [0.5 0.5] The nearest neighbors are: {18: 0.1145525494040436, 57: 0.1761308268234265, 42: 0.40349328133650497} Majority Class: B, Class Votes: {'B': 2, 'A': 1} [-0.3 -1. ] The nearest neighbors are: {97: 0.20728789054686103, 74: 0.3003302017567334, 71: 0.3012412820871216} Majority Class: B, Class Votes: {'B': 3} [0. 1.5] The nearest neighbors are: {13: 0.3388498817044854, 47: 0.3745684541413127, 26: 0.522112611211174} Majority Class: A, Class Votes: {'B': 1, 'A': 2}