K-Nearest Neighbors (KNN)
The “Tell me who your friends are, and I’ll tell you who you are” Algorithm
1. The Analogy: The Friendly Neighborhood
Imagine you move into a new house, and you don’t know if the neighborhood is a “Party Neighborhood” or a “Quiet Neighborhood.”
You look at your 3 nearest neighbors:
- Neighbor 1: Throws parties every Friday. (Party)
- Neighbor 2: Throws parties every Saturday. (Party)
- Neighbor 3: Reads books silently at 8 PM. (Quiet)
Result: Since 2 out of 3 neighbors are “Party” people, you assume you are in a Party Neighborhood.
This is exactly how KNN works. It doesn’t learn a complex rule (like a decision tree). It simply memorizes the data and looks at “who is closest” when asked to make a decision.
2. Interactive Visualizer
Instructions: Click anywhere on the white canvas below to place a Green Query Point. The algorithm will find the nearest neighbors and classify the point as Red or Blue.
3. How It Works (Step-by-Step)
- Choose K: Select the number of neighbors (e.g., K = 5).
- Calculate Distance: For a new data point, calculate the distance to every other point in the dataset.
- Sort: Sort the distances from smallest to largest.
- Pick Top K: Select the first K entries from the sorted list.
- Vote (Classification) or Average (Regression):
- Classification: The majority class wins (e.g., 3 Red, 2 Blue → Red).
- Regression: Take the average value of the K neighbors.
4. Mathematics: Measuring Distance
How do we define “Close”? There are two main ways:
Euclidean Distance
The “Crow flies” distance. A straight line between two points.
Use when: You have continuous data (like height, weight).
Manhattan Distance
The “Taxi Cab” distance. You can only move along grid lines (blocks).
Use when: You have high-dimensional data or grid-like structures.
5. KNN in the ML Space
Is it Supervised or Unsupervised?
KNN is primarily a Supervised Learning algorithm because it relies on labeled data (we need to know if the neighbors are Red or Blue to make a prediction).
Special Case: KNN for Anomaly Detection (Unsupervised)
While usually supervised, KNN is powerful for finding outliers/anomalies.
How to detect anomalies:
- Calculate the average distance to the K nearest neighbors for every point.
- Points with the largest average distance are flagged as anomalies.
- Used in: Fraud detection, network intrusion, and manufacturing defect detection.
6. Performance & Metrics
| Pros | Cons |
|---|---|
| Simple to understand and implement. | Slow with large datasets (Lazy Learner). |
| No training period (instant learning). | Sensitive to Outliers. |
| Handles multi-class cases easily. | Struggles with high dimensions (Curse of Dimensionality). |
Evaluation Metrics
Since KNN is usually a classifier, we use standard confusion matrix metrics:
- Accuracy: (Correct Predictions) / (Total Predictions).
- Precision: “Of all the points I called Red, how many were actually Red?”
- Recall: “Of all the actual Red points, how many did I find?”