There are various different methods of Cluster Analysis, of which the Hierarchical Method is one of the most commonly used. In this method, the algorithm builds a hierarchy of clusters, where the data is organized in a hierarchical tree, as shown in the figure below:
Hierarchical clustering has two approaches − the top-down approach (Divisive Approach) and the bottom-up approach (Agglomerative Approach). In this article, we will look at the Agglomerative Clustering approach.
In Agglomerative Clustering, initially, each object/data is treated as a single entity or cluster. The algorithm then agglomerates pairs of data successively, i.e., it calculates the distance of each cluster with every other cluster. Two clusters with the shortest distance (i.e., those which are closest) merge and create a newly formed cluster which again participates in the same process. The algorithm keeps on merging the closer objects or clusters until the termination condition is met. This results in a tree-like representation of the data objects – dendrogram.
In the above dendrogram, we have 14 data points in separate clusters. In the dendrogram, the height at which two data points or clusters are agglomerated represents the distance between those two clusters in the data space.
The first step in agglomerative clustering is the calculation of distances between data points or clusters. Let’s look at some commonly used distance metrics:
It is the shortest distance between two points. In n-dimensional space:
The linkage creation step in Agglomerative clustering is where the distance between clusters is calculated. So basically, a linkage is a measure of dissimilarity between the clusters. There are several methods of linkage creation. Some of them are:
In Single Linkage, the distance between the two clusters is the minimum distance between clusters’ data points.
In Complete Linkage, the distance between two clusters is the maximum distance between clusters’ data points.
In Average Linkage, the distance between clusters is the average distance between each data point in one cluster to every data point in the other cluster.
Let’s take a look at an example of Agglomerative Clustering in Python. The function AgglomerativeClustering() is present in Python’s sklearn library.
Consider the following data:
import pandas as pd data = [[200,500], [700,750], [250,450], [300,550], [600,300], [500,350], [700,850], [650,700], [900,700], [400,200]] df = pd.DataFrame(data, columns=[‘A’,’B’]) print(df) |
The DataFrame is:
A B 0 200 500 1 700 750 2 250 450 3 300 550 4 600 300 5 500 350 6 700 850 7 650 700 8 900 700 9 400 200 |
Let’s view the dendrogram for this data. The python code to do so is:
import scipy.cluster.hierarchy as sch dendrogram = sch.dendrogram(sch.linkage(df, method=’average’)) |
In this code, Average linkage is used. The dendrogram is:
Agglomerative Clustering function can be imported from the sklearn library of python. Looking at three colors in the above dendrogram, we can estimate that the optimal number of clusters for the given data = 3. Let’s create an Agglomerative clustering model using the given function by having parameters as:
from sklearn.cluster import AgglomerativeClustering agg_model = AgglomerativeClustering(n_clusters=3, affinity=’euclidean’, linkage=’average’) agg_model.fit(df) |
The labels_ property of the model returns the cluster labels, as:
cluster_label = agg_model.labels_ print(cluster_label) |
The labels for the above data are:
[2 0 2 2 1 1 0 0 0 1] |
To visualize the clusters in the above data, we can plot a scatter plot as:
# inserting the labels column in the original DataFrame df.insert(2, “Label”, cluster_label, True) # plotting the data import seaborn as sns sns.lmplot(data=df, x=’A’, y=’B’, hue=’Label’, fit_reg=False, legend=True, legend_out=True) |
Visualization for the data and clusters is:
The above figure clearly shows the three clusters and the data points which are classified into those clusters.
In this article, we focused on Agglomerative Clustering. In the next article, we will look into DBSCAN Clustering.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…