K-Means Clustering

k-Means Clustering is the Partitioning-based clustering method and is the most popular and widely used method of Cluster Analysis. The unsupervised learning method works on multidimensional data.

k-Means is a very fast and efficient algorithm. It is simple and easy to understand.

k-means is a centroid-based method, where the number k is defined to refer to the number of centroids (or clusters, as centroid is the center of the cluster) in the data. It then assigns the data points to the nearest cluster. The data points within a cluster should be similar to each other. Hence, we need to minimize the distance between the data points within a cluster using the distance metrics.

In K-means, the ‘means’ implies averaging of the data to find the centroid and thus define the cluster.

Algorithm

  1. Randomly select k data points (called ‘means’) and initialize them.
  2. For each data point, calculate its distance from each of the clusters’ center
  3. Assign the data point to the closest mean
  4. Update the coordinates of the mean (i.e., the average of all data points in that cluster)
  5. Repeat the process for the specified number of iterations

Halting Criteria

There are certain stopping criteria that can be used to stop the k-Means algorithm:

  • The defined number of iterations are reached
  • No data points are getting reassigned, i.e., points remain in the same cluster
  • There is no change in the value of centroids of the newly formed clusters

Example

Let’s take a look at an example of k-Means Clustering in Python. The function kMeans() is present in Python’s sklearn library.

Consider the following data:

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# create the dataset
centers = [[2,1], [-2,2], [-2,-2]]
x, y = make_blobs(n_samples=300, centers=centers, cluster_std=0.6)

# normalization of the values
x = StandardScaler().fit_transform(x)

Here, we are creating data with three clusters. The function used to create these data points is the make_blobs() function. The parameters for this function for our data are – 300 sample data points, the standard deviation of a cluster is 0.6, the centers are as defined in the code.

We can visualize these data points using matplotlib, as:

import matplotlib.pyplot as plt
plt.scatter(x[:,0], x[:,1])
plt.show()

The scatter plot is:

Now, let’s apply the k-Means algorithm to the above data points. The python code to do so is:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3).fit(x)

Here, for the k-Means algorithm, we have specified the number of clusters = 3.

Let’s visualize how the points are classified in the clusters:

y2 = kmeans.predict(x)

plt.scatter(x[:, 0], x[:, 1], c=y2)
plt.show()

The scatter plot is:

We can see that the algorithm has identified three clusters.

We can also view the centers of these clusters, as:

kcenter = kmeans.cluster_centers_
plt.scatter(x[:, 0], x[:, 1], c=y2)
plt.scatter(kcenter[:, 0], kcenter[:, 1], color=’red’)
plt.show()

The above code plots the center points in red, as:

Summary

In this article, we focused on K-means Clustering. In the next article, we will look into a Spectral Clustering.

Pallavi Pandey

Recent Posts

MapReduce Algorithm

In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…

8 months ago

Linear Programming using Pyomo

Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…

1 year ago

Networking and Professional Development for Machine Learning Careers in the USA

In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…

1 year ago

Predicting Employee Churn in Python

Analyze employee churn, Why employees are leaving the company, and How to predict, who will…

2 years ago

Airflow Operators

Airflow operators are core components of any workflow defined in airflow. The operator represents a…

2 years ago

MLOps Tutorial

Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…

2 years ago