Skip to content
Home » My Blog Tutorial » Distance Metrics in Hierarchical Clustering

Distance Metrics in Hierarchical Clustering

Distance Metrics

Distance Metrics are essential in hierarchical clustering. They determine how the similarity between data points is measured, directly impacting the formation and quality of clusters. In this tutorial, we will explore various distance metrics, including Euclidean, Manhattan, and Cosine Distance. We will demonstrate their implementation in Python and analyze their effects on hierarchical clustering outcomes.

When working with hierarchical clustering, distance metrics play a vital role in data analysis. First and foremost, these measuring tools help us understand how close or far apart data points are from each other. Moreover, the choice of distance metric significantly shapes how our clusters form and combine.

To begin with, the most common distance measurements include the straightforward Euclidean distance, which measures direct paths between points. Additionally, we have the Manhattan distance, which calculates distances along grid-like paths. Furthermore, the Cosine distance helps us compare the angles between data points.

In practical applications, these distance metrics serve different purposes. For instance, Euclidean distance works well with physical measurements, while Cosine distance is better suited for text analysis. Subsequently, when implementing these metrics in Python, we can observe how each method affects our clustering results differently.

Most importantly, choosing the right distance metric depends on your data type and clustering goals. Therefore, understanding these measurement tools is crucial for successful hierarchical clustering analysis. Finally, by carefully selecting and applying the appropriate distance metric, we can achieve more meaningful and accurate clustering results.

Introduction to Distance Metrics

Distance Metrics quantify the similarity or dissimilarity between data points. In hierarchical clustering, selecting the appropriate distance metric influences how clusters form and relate to each other. Common metrics include Euclidean Distance, Manhattan Distance, and Cosine Distance. Each offers unique advantages depending on the dataset and the problem at hand.

Euclidean Distance

Euclidean Distance is the straight-line distance between two points in Euclidean space. It is widely used due to its simplicity and effectiveness in measuring the actual geometric distance between points.

Mathematical Definition

The Euclidean distance between two points $ \mathbf{p} $ and $ \mathbf{q} $ is defined as:

[
distance = \sqrt{(p_1 – q_1)^2 + (p_2 – q_2)^2 + \dots + (p_n – q_n)^2}
]

Python Implementation

Here is how you can implement Euclidean Distance in Python:

import math

def euclidean_distance(point1, point2):
    return math.sqrt(sum((p1 - p2) ** 2 for p1, p2 in zip(point1, point2)))

Manhattan Distance

Manhattan Distance, also known as L1 distance, measures the distance between two points by summing the absolute differences of their coordinates. This metric is particularly useful in high-dimensional spaces.

Mathematical Definition

The Manhattan distance between two points ( \mathbf{p} ) and ( \mathbf{q} ) is defined as:

[
distance = |p_1 – q_1| + |p_2 – q_2| + \dots + |p_n – q_n|
]

Python Implementation

Below is the Python function to calculate Manhattan Distance:

def manhattan_distance(point1, point2):
    return sum(abs(p1 - p2) for p1, p2 in zip(point1, point2))

Cosine Distance

Cosine Distance measures the cosine of the angle between two vectors, providing a measure of orientation rather than magnitude. It is particularly useful in text analysis and high-dimensional data.

Mathematical Definition

The Cosine distance between two points ( \mathbf{p} ) and ( \mathbf{q} ) is calculated as:

[
distance = 1 – \left( \frac{\mathbf{p} \cdot \mathbf{q}}{||\mathbf{p}|| \times ||\mathbf{q}||} \right)
]

Python Implementation

Here is how you can compute Cosine Distance in Python:

import numpy as np

def cosine_distance(point1, point2):
    return 1 - np.dot(point1, point2) / (np.linalg.norm(point1) * np.linalg.norm(point2))

Implementing Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters by either merging or splitting them successively. The choice of distance metric significantly affects how clusters form and their final structure.

Agglomerative Clustering Algorithm

Agglomerative clustering is a bottom-up approach where each data point starts in its own cluster, and pairs of clusters merge as you move up the hierarchy. The distance metric determines the criteria for merging clusters.

Python Implementation

Below is the Python code for the agglomerative hierarchical clustering algorithm:

import numpy as np

def calculate_distance_matrix(X, clusters, distance_metric):
    dist_matrix = np.zeros((len(clusters), len(clusters)))
    for i in range(len(clusters)):
        for j in range(i+1, len(clusters)):
            dists = [distance_metric(X[k], X[l]) for k in clusters[i] for l in clusters[j]]
            dist_matrix[i, j] = dist_matrix[j, i] = np.mean(dists)
    return dist_matrix

def agglomerative_clustering(X, n_clusters, distance_metric):
    clusters = [[i] for i in range(len(X))]
    while len(clusters) > n_clusters:
        dist_matrix = calculate_distance_matrix(X, clusters, distance_metric)
        min_dist = float('inf')
        for i in range(len(clusters)):
            for j in range(i+1, len(clusters)):
                if dist_matrix[i, j] < min_dist:
                    min_dist = dist_matrix[i, j]
                    idx1, idx2 = i, j
        clusters[idx1].extend(clusters[idx2])
        clusters.pop(idx2)
    labels = np.empty(len(X), dtype=int)
    for label, cluster in enumerate(clusters):
        for i in cluster:
            labels[i] = label
    return labels

Impact of Distance Metrics on Clustering

Different distance metrics can lead to varying clustering results. Understanding their impact helps in selecting the right metric for your specific dataset and analysis goals.

Dataset Preparation

We will use the Iris dataset for demonstration, which is a widely used dataset in machine learning for classification and clustering tasks.

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
dataset = load_iris().data

# Scale the dataset with StandardScaler
scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)

Performing Agglomerative Clustering

We will apply agglomerative clustering using different distance metrics to observe their effects on the clustering outcome.

# Perform Agglomerative Clustering
n_clusters = 3

# Euclidean Distance
labels_euc = agglomerative_clustering(dataset, n_clusters, euclidean_distance)

# Manhattan Distance
labels_man = agglomerative_clustering(dataset, n_clusters, manhattan_distance)

# Cosine Distance
labels_cos = agglomerative_clustering(dataset, n_clusters, cosine_distance)

Visualizing the Results

Visualizing the clustering results helps in comparing how different distance metrics influence the formation of clusters.

# Plot the results in 3 subplots for each distance metric
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# Euclidean Distance
axs[0].scatter(dataset[:, 0], dataset[:, 1], c=labels_euc, cmap='viridis')
axs[0].set_title('Euclidean Distance')
axs[0].set_xlabel('Sepal Length')
axs[0].set_ylabel('Sepal Width')

# Manhattan Distance
axs[1].scatter(dataset[:, 0], dataset[:, 1], c=labels_man, cmap='viridis')
axs[1].set_title('Manhattan Distance')
axs[1].set_xlabel('Sepal Length')
axs[1].set_ylabel('Sepal Width')

# Cosine Distance
axs[2].scatter(dataset[:, 0], dataset[:, 1], c=labels_cos, cmap='viridis')
axs[2].set_title('Cosine Distance')
axs[2].set_xlabel('Sepal Length')
axs[2].set_ylabel('Sepal Width')

plt.show()

Using Scikit-Learn for Hierarchical Clustering

Scikit-Learn provides a convenient implementation of agglomerative clustering, allowing you to specify different distance metrics easily.

from sklearn.cluster import AgglomerativeClustering

# Agglomerative Clustering using sklearn with Euclidean distance.
model_euc = AgglomerativeClustering(n_clusters=n_clusters, affinity='euclidean', linkage='average')
labels_euc = model_euc.fit_predict(dataset)

# Agglomerative Clustering using sklearn with Manhattan distance.
model_man = AgglomerativeClustering(n_clusters=n_clusters, affinity='manhattan', linkage='average')
labels_man = model_man.fit_predict(dataset)

# Agglomerative Clustering using sklearn with Cosine distance.
model_cos = AgglomerativeClustering(n_clusters=n_clusters, affinity='cosine', linkage='average')
labels_cos = model_cos.fit_predict(dataset)

Conclusion

Choosing the right distance metric is fundamental in hierarchical clustering as it directly affects the formation and quality of clusters. By understanding and implementing metrics like Euclidean, Manhattan, and Cosine Distance, you can tailor your clustering approach to better suit your data and analysis objectives. For more detailed information on hierarchical clustering techniques, visit the Scikit-Learn Clustering Documentation.

Practice Exercises

To solidify your understanding, try implementing hierarchical clustering with these additional distance metrics on different datasets. Experiment with adjusting the number of clusters and observe how the results change.



Discover more from teguhteja.id

Subscribe to get the latest posts sent to your email.

Leave a Reply

Optimized by Optimole
WP Twitter Auto Publish Powered By : XYZScripts.com

Discover more from teguhteja.id

Subscribe now to keep reading and get access to the full archive.

Continue reading