K-Nearest Neighbors (KNN)

The “Tell me who your friends are, and I’ll tell you who you are” Algorithm

1. The Analogy: The Friendly Neighborhood

Imagine you move into a new house, and you don’t know if the neighborhood is a “Party Neighborhood” or a “Quiet Neighborhood.”

How do you figure it out?

You look at your 3 nearest neighbors:

  • Neighbor 1: Throws parties every Friday. (Party)
  • Neighbor 2: Throws parties every Saturday. (Party)
  • Neighbor 3: Reads books silently at 8 PM. (Quiet)

Result: Since 2 out of 3 neighbors are “Party” people, you assume you are in a Party Neighborhood.

This is exactly how KNN works. It doesn’t learn a complex rule (like a decision tree). It simply memorizes the data and looks at “who is closest” when asked to make a decision.

2. Interactive Visualizer

Instructions: Click anywhere on the white canvas below to place a Green Query Point. The algorithm will find the nearest neighbors and classify the point as Red or Blue.

3
Click to classify…

3. How It Works (Step-by-Step)

  1. Choose K: Select the number of neighbors (e.g., K = 5).
  2. Calculate Distance: For a new data point, calculate the distance to every other point in the dataset.
  3. Sort: Sort the distances from smallest to largest.
  4. Pick Top K: Select the first K entries from the sorted list.
  5. Vote (Classification) or Average (Regression):
    • Classification: The majority class wins (e.g., 3 Red, 2 Blue → Red).
    • Regression: Take the average value of the K neighbors.

4. Mathematics: Measuring Distance

How do we define “Close”? There are two main ways:

Euclidean Distance

The “Crow flies” distance. A straight line between two points.

√((x₂ – x₁)² + (y₂ – y₁)²)

Use when: You have continuous data (like height, weight).

Manhattan Distance

The “Taxi Cab” distance. You can only move along grid lines (blocks).

|x₂ – x₁| + |y₂ – y₁|

Use when: You have high-dimensional data or grid-like structures.

5. KNN in the ML Space

Is it Supervised or Unsupervised?

KNN is primarily a Supervised Learning algorithm because it relies on labeled data (we need to know if the neighbors are Red or Blue to make a prediction).


Special Case: KNN for Anomaly Detection (Unsupervised)

While usually supervised, KNN is powerful for finding outliers/anomalies.

The Logic: If a data point’s nearest neighbors are very far away, that point is lonely. It doesn’t fit in. It is an anomaly.

How to detect anomalies:

  1. Calculate the average distance to the K nearest neighbors for every point.
  2. Points with the largest average distance are flagged as anomalies.
  3. Used in: Fraud detection, network intrusion, and manufacturing defect detection.

6. Performance & Metrics

Pros Cons
Simple to understand and implement. Slow with large datasets (Lazy Learner).
No training period (instant learning). Sensitive to Outliers.
Handles multi-class cases easily. Struggles with high dimensions (Curse of Dimensionality).

Evaluation Metrics

Since KNN is usually a classifier, we use standard confusion matrix metrics:

  • Accuracy: (Correct Predictions) / (Total Predictions).
  • Precision: “Of all the points I called Red, how many were actually Red?”
  • Recall: “Of all the actual Red points, how many did I find?”

Leave a Reply

Your email address will not be published. Required fields are marked *