Cluster Analysis comprises of many different methods, of which one is the Density-based Clustering Method. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. For a given set of data points, the DBSCAN algorithm clusters together those points that are close to each other based on any distance metric and a minimum number of points.
DBSCAN works on the idea that clusters are dense groups of points. These dense groups of data points are separated by low-density regions. The real-world data contains outliers and noise. It can also have arbitrary shapes (as shown in figures below), due to which the commonly used clustering algorithms (like k-means) fail to perform properly. For such arbitrary shaped clusters or data containing noise, the density-based algorithms such as DBSCAN are more efficient.
In the above figures, data points are arbitrarily shaped. Density-based clustering algorithms are used to find the high-density regions and group them into clusters
DBSCAN Clustering Algorithm requires two parameters:
Hence, after the DBSCAN algorithm, based on these two parameters, there are three types of points:
Algorithm:
Let’s take a look at an example of DBSCAN Clustering in Python. The function DBSCAN() is present in Python’s sklearn library. Consider the following data:
from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler # create the dataset centers = [[2,1], [0,0], [-2,2], [-2,-2]] x, y = make_blobs(n_samples=300, centers=centers, cluster_std=0.4) # normalization of the values x = StandardScaler().fit_transform(x) |
Here, we are creating data with four clusters. The function used to create these data points is the make_blobs() function. The parameters for this function for our data are – 300 sample data points, the standard deviation of a cluster is 0.4, the centers are as defined in the code.
We can visualize these data points using matplotlib, as:
import matplotlib.pyplot as plt plt.scatter(x[:,0], x[:,1]) plt.show() |
The scatter plot is:
Now, let’s apply the DBSCAN algorithm to the above data points. The python code to do so is:
from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.3, min_samples=25) db.fit(x) |
Here, for the DBSCAN algorithm, we have specified ‘eps = 0.3’ and ‘minPts = 25’.
Let’s visualize how the points are classified in the clusters:
y2 = db.fit_predict(x) plt.scatter(x[:,0], x[:,1], c=y2) plt.show() |
The scatter plot is:
We can see that the algorithm has identified four clusters, marked with Yellow, Green, Dark Blue, and Light Blue colors. The Violet points outside each cluster are the noise/outlier points as detected by the algorithm. By modifying the eps and minPts values, we can get different configurations of the clusters.
In this article, we focused on DBSCAN Clustering. In the next article, we will look into a Partitioning-based clustering algorithm k-means Clustering.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…