Skip to content
Home » My Blog Tutorial » K-means Clustering: Tutorial on Clusters and Centroids

K-means Clustering: Tutorial on Clusters and Centroids

K-means clustering

K-means Clustering remains a vital technique when you want to group data points in a meaningful way, and this tutorial will guide you through selecting the number of clusters and performing centroid initialization in your projects. In particular, K-means Clustering, also known as Clustering in Python with centroids, helps you solve tasks like segmenting customers or identifying hidden patterns. You must pay close attention to the number of clusters and how you initialize each centroid, because these factors strongly affect your final clustering outcome.

K-means Methods and the Importance of Proper Cluster Selection

You will find that choosing the correct number of clusters in K-means Clustering can feel tricky. Nevertheless, taking systematic steps will help. First, you might create an “elbow plot” that shows how the within-cluster sum of squares (WCSS) changes as you increase the number of clusters. This approach, often called the “Elbow Method,” offers insight into where to draw the line for K, the number of clusters. Meanwhile, some data scientists consider other metrics like the “Silhouette Score” to make more exact decisions. In every case, you should remember that erroneous cluster selection can lead to inaccurate groups or unexpected results.

Centroid Initialization in K-means Clustering and Its Subtleties

Centroid initialization stands at the core of how K-means Clustering operates. You typically choose initial centroid positions either randomly or with smarter algorithms. For instance, the K-means++ method selects your first centroid at random and then positions subsequent centroids at points far from already chosen centroids. This strategy lowers the chance of sub-optimal results that come from unlucky random starts. You should know that centroid initialization deeply influences the final clusters, so you need to take time to set it up properly or use robust algorithms built into libraries like scikit-learn.

Benefits of Python Libraries for K-means Cluster Assignments

Python libraries, such as scikit-learn, provide powerful interfaces for K-means Clustering. You can adjust hyperparameters like the number of clusters (K), the initialization method, and the number of times to repeat the clustering. These libraries handle the math behind the scenes while giving you flexible control. At times, you also want to see how cluster assignment changes when you vary K or tweak centroid initialization. In those cases, scikit-learn remains helpful, because it lets you run new experiments quickly with different setups. If you want to learn more about scikit-learn’s K-means, you can visit scikit-learn K-means Documentation for detailed information.

Step-by-Step Guide: Initial Centroids and Choosing the Number of Clusters

In this section, I will show you a concrete example of how you might handle K-means Clustering in Python. You will see how you can choose K, define initial centroids, and compare results. I will use scikit-learn for simplicity, but you can adapt the concepts to other libraries. Let’s get started by importing the necessary libraries and creating a dataset.

# We begin by importing essential libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Created/Modified files during execution:
# (No files created/modified in this code snippet)

Here, I imported NumPy for handling arrays, Matplotlib for plotting graphs, and KMeans from scikit-learn for implementing the clustering algorithm. I find these libraries straightforward to work with; they make K-means clustering both efficient and clear to read.

Data Generation and Visualization

Next, I will form a small dataset to illustrate how cluster selection and centroid initialization look in practice. You should keep in mind that real datasets usually require more preprocessing, but the logic remains the same, regardless of size or complexity.

# Let's create a simple 2D dataset
np.random.seed(42)
data = np.array([
    [3, 1], [5, 1], [2, 3],
    [8, 2], [9, 3], [7, 1],
    [15, 15], [13, 16], [14, 14]
])

# Quick visualization
plt.scatter(data[:, 0], data[:, 1], c='blue')
plt.title("Visualization of Synthetic Dataset")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

# Created/Modified files during execution:
# (No files created/modified in this code snippet)

I kept the dataset small so that you can easily see how the groups will change. Notice that the first few points cluster around the lower-left side, while the last three points spread in the top-right space. This difference will matter for our K-means Clustering and centroid initialization steps.

Clustering with K=2 vs. K=3

I will now apply K-means Clustering with two different values of K. I will start by choosing two clusters (K=2). Then I will repeat the process with three clusters (K=3). In practice, you might run these experiments multiple times to see how your dataset behaves with different cluster counts.

# K = 2
kmeans_2 = KMeans(n_clusters=2, n_init=10, random_state=42)
kmeans_2.fit(data)
labels2 = kmeans_2.labels_
centroids2 = kmeans_2.cluster_centers_

# K = 3
kmeans_3 = KMeans(n_clusters=3, n_init=10, random_state=42)
kmeans_3.fit(data)
labels3 = kmeans_3.labels_
centroids3 = kmeans_3.cluster_centers_

# Side-by-side visualization
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

axs[0].scatter(data[:, 0], data[:, 1], c=labels2, cmap='viridis')
axs[0].scatter(centroids2[:, 0], centroids2[:, 1], c='red', marker='X')
axs[0].set_title("K-means with K=2")

axs[1].scatter(data[:, 0], data[:, 1], c=labels3, cmap='viridis')
axs[1].scatter(centroids3[:, 0], centroids3[:, 1], c='red', marker='X')
axs[1].set_title("K-means with K=3")

plt.show()

# Created/Modified files during execution:
# (No files created/modified in this code snippet)

On the left, you will notice that K=2 clusters the first six points together (lower-left) and the last three together (top-right). On the right, setting K=3 divides the first six points into two smaller groups while the last three remain in their own cluster. You can see how the number of clusters greatly changes the results. In real-world tasks, you might rely on a silhouette analysis or the elbow method to decide a good K.

Manual Centroid Initialization for K-means Clustering

Sometimes, you want direct control over centroid initialization. You might use domain knowledge or an informed guess to give the algorithm a better start. You can specify the exact positions of your initial centroids. However, you should note that scikit-learn’s default “k-means++” usually does a good job. Let’s demonstrate how to manually set the centroids:

# We choose three clusters and manual centroids
num_clusters = 3

# Let's pick three points from the data as our initial centroids
initial_centroids = data[np.random.choice(range(data.shape[0]), num_clusters, replace=False), :]

kmeans_manual = KMeans(n_clusters=num_clusters, init=initial_centroids, n_init=1, random_state=42)
kmeans_manual.fit(data)

manual_labels = kmeans_manual.labels_
manual_centers = kmeans_manual.cluster_centers_

plt.scatter(data[:, 0], data[:, 1], c=manual_labels, cmap='plasma')
plt.scatter(manual_centers[:, 0], manual_centers[:, 1], c='red', marker='X')
plt.title("Manual Centroid Initialization")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

# Created/Modified files during execution:
# (No files created/modified in this code snippet)

In the code above, I set n_init=1 so that scikit-learn only uses my chosen centroids once, rather than running multiple experiments behind the scenes. This setup helps you see how K-means Clustering depends on your initial guesses. If you change random_state or pick different points, you could see a new output.

Why Initial Centroids and K Matter in Clustering

K-means Clustering is an iterative algorithm that stops once each point belongs to the cluster with the nearest centroid, and once those centroids no longer shift significantly during each iteration. However, you only find a local minimum, not necessarily the best global centroid arrangement. If you place initial centroids poorly, K-means might end up stuck. Additionally, if you pick the wrong K, you might group dissimilar points together. Transitions toward better results involve trying different initial seeds, exploring more clusters, or applying more advanced variant methods like K-means++.

Summarizing Key Insights

1. **Cluster Count is Crucial**: You must analyze your data with multiple values of K. You should rely on methods like the elbow method or silhouette analysis to confirm you are not under-clustering or over-clustering. 2. **Centroid Placement Matters**: Poor initial centroids can produce subpar results, but random restarts or K-means++ reduce this issue. 3. **Python Libraries Help**: With scikit-learn, you can quickly experiment by changing K, tweaking centroid initialization, or investigating other metrics. 4. **Iterative Refinement**: K-means keeps recalculating cluster centers until centroids become stable. You can run multiple trials to ensure you pick the best outcome.

Further Practice in K-means Clustering with Centroids

I encourage you to experiment with larger real-world datasets. Many times, you will find that data does not spread so cleanly. By applying this approach, you will see how K-means Clustering helps you detect natural groupings or anomalies in data. You might also add domain-specific steps, like normalizing or standardizing numeric values, to make clusters more consistent.

Conclusion: K-means Clustering in Action

K-means Clustering stands as one of the most accessible and widely used unsupervised learning methods. Indeed, it opens the door to new insights by illuminating natural data groupings. You have seen the impact that choosing clusters (K) and managing centroid initialization exerts over your final results. Now that you have walked through examples and code, you can confidently apply these concepts in your own projects. Remember to run multiple tests, consider advanced metrics, and follow best practices for centroid selection. This awareness ensures that your K-means Clustering pipeline remains trustworthy and that you continue to reap data-driven benefits.


Discover more from teguhteja.id

Subscribe to get the latest posts sent to your email.

Leave a Reply

Optimized by Optimole
WP Twitter Auto Publish Powered By : XYZScripts.com

Discover more from teguhteja.id

Subscribe now to keep reading and get access to the full archive.

Continue reading