k-Means Clustering is the Partitioning-based clustering method and is the most popular and widely used method of Cluster Analysis. The unsupervised learning method works on multidimensional data.
k-Means is a very fast and efficient algorithm. It is simple and easy to understand.
k-means is a centroid-based method, where the number k is defined to refer to the number of centroids (or clusters, as centroid is the center of the cluster) in the data. It then assigns the data points to the nearest cluster. The data points within a cluster should be similar to each other. Hence, we need to minimize the distance between the data points within a cluster using the distance metrics.
In K-means, the ‘means’ implies averaging of the data to find the centroid and thus define the cluster.
There are certain stopping criteria that can be used to stop the k-Means algorithm:
Let’s take a look at an example of k-Means Clustering in Python. The function kMeans() is present in Python’s sklearn library.
Consider the following data:
from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler # create the dataset centers = [[2,1], [-2,2], [-2,-2]] x, y = make_blobs(n_samples=300, centers=centers, cluster_std=0.6) # normalization of the values x = StandardScaler().fit_transform(x) |
Here, we are creating data with three clusters. The function used to create these data points is the make_blobs() function. The parameters for this function for our data are – 300 sample data points, the standard deviation of a cluster is 0.6, the centers are as defined in the code.
We can visualize these data points using matplotlib, as:
import matplotlib.pyplot as plt plt.scatter(x[:,0], x[:,1]) plt.show() |
The scatter plot is:
Now, let’s apply the k-Means algorithm to the above data points. The python code to do so is:
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3).fit(x) |
Here, for the k-Means algorithm, we have specified the number of clusters = 3.
Let’s visualize how the points are classified in the clusters:
y2 = kmeans.predict(x) plt.scatter(x[:, 0], x[:, 1], c=y2) plt.show() |
The scatter plot is:
We can see that the algorithm has identified three clusters.
We can also view the centers of these clusters, as:
kcenter = kmeans.cluster_centers_ plt.scatter(x[:, 0], x[:, 1], c=y2) plt.scatter(kcenter[:, 0], kcenter[:, 1], color=’red’) plt.show() |
The above code plots the center points in red, as:
In this article, we focused on K-means Clustering. In the next article, we will look into a Spectral Clustering.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…