Top-20 frequently asked data science interview questions and answers on Unsupervised Learning for fresher and experienced Data Scientist, Data analyst, statistician, and machine learning engineer job role.
Data Science is an interdisciplinary field. It uses statistics, machine learning, databases, visualization, and programming. So in this fourth article, we are focusing on unsupervised learning questions.
Let’s see the interview questions.
Clustering is unsupervised learning because it does not have a target variable or class label. Clustering divides s given data observations into several groups (clusters) or a bunch of observations based on certain similarities. For example, segmenting customers, grouping super-market products such as cheese, meat products, appliances, etc.
Dimensionality reduction is the process of reducing the number of attributes from large dimensional data. There are lots of methods for reducing the dimension of the data: Principal Components Analysis(PCA), t-SNE, Wavelet Transformation, Factor Analysis, Linear Discriminant Analysis, and Attribute Subset Selection.
Kmeans algorithm is an iterative algorithm that partitions the dataset into a pre-defined number of groups or clusters where each observation belongs to only one group.
K-means algorithm works in the following steps:
Elbow Criteria: This method is used to choose the optimal number of clusters (groups) of objects. It says that we should choose a number of clusters so that adding another cluster does not add sufficient information to continue the process. The percentage of variance explained is the ratio of the between-group variance to the total variance. It selects the point where marginal gain will drop.
You can also create an elbow method graph between the within-cluster sum of squares(WCSS) and the number of clusters K. Here, the within-cluster sum of squares(WCSS) is a cost function that decreases with an increase in the number of clusters. The Elbow plot looks like an arm, then the elbow on the arm is an optimal number of k.
There are the following disadvantages:
The cluster can be evaluated using two types of measures intrinsic and extrinsic evaluation parameters. Intrinsic does not consider the external class labels while extrinsic considers the external class labels. Intrinsic cluster evaluation measures are the Davie-Bouldin Index and Silhouette coefficient. Extrinsic evaluation measures are Jaccard and Rand Index.
There are some clustering algorithms that can generate random or arbitrary shape clusters such as Density-based methods such as DBSCAN, OPTICS, and DENCLUE. Spectral clsutering can also generate arbitrary or random shape clusters.
Euclidean measures the ‘as-the-crow-flies’ distance and Manhattan distance is also known as a city block. It measures the distance in blocks between any two points in a city. (or city block).
It is based on standard linear algebra. Spectral Clustering uses the connectivity approach to clustering. It easy to implement, faster especially for the sparse datasets, and can generate non-convex clusters. Spectral clustering kind of graph partitioning algorithm. The spectral algorithm works in the following steps.
t-SNE stands for t-Distributed Stochastic Neighbor Embedding which considers the nearest neighbors for reducing the data. t-SNE is a nonlinear dimensionality reduction technique. With a large dataset, it will not produce better results. t-SNE has quadratic time and space complexity.
The t-SNE algorithm computes the similarity between pairs of observations in the high dimensional space and low dimensional space. And then it optimizes both similarity measures. In simple words we can say, it maps the high-dimensional data into a lower-dimensional space. After transformation input features can’t be inferred from the reduced dimensions. It can be used in recognizing feature expressions, tumor detection, compression, information security, and bioinformatics.
PCA is the process of reducing the dimension of input data into a lower dimension while keeping the essence of all original variables. It used is used to speed up the model generation process and helps in visualizing the large dimensional data.
There are three methods for deciding the number of components:
Eigenvectors are rotational axes of the linear transformation. These axes are fixed in direction, and eigenvalue is the scale factor by which the matrix is scaled up or down. Eigenvalues are also known as characteristic values or characteristic roots and eigenvectors are also known as the characteristic vector.
SVM works better with lower-dimensional data compared to large dimensional data. When the number of features is greater than the number of observations, then performing dimensionality reduction will generally improve the SVM.
t-SNE in comparison to PCA:
Benefits:
Limitations:
The main idea is to create clusters and add objects as long as the density in its neighborhood exceeds some threshold. The density of any object measured by the number of objects closed to that. It connects the main object with its neighborhoods to form dense regions as clusters. You can also define density as the size of the neighborhood €. DBSCAN also uses another user-specified parameter, MinPts, that specifies the density threshold of dense regions.
Hierarchical method partition data into groups at different levels such as in a hierarchy. Observations are group together on the basis of their mutual distance. Hierarchical clustering is of two types: Agglomerative and Divisive.
Agglomerative methods start with individual objects like clusters, which are iteratively merged to form larger clusters. It starts with leaves or individual records and merges two clusters that are closest to each other according to some similarity measure and form one cluster. It is also known as AGNES (AGglomerative NESting).
Divisive methods start with one cluster, which they iteratively split into smaller clusters. It divides the root cluster into several smaller sub-clusters, and recursively partitions those clusters into smaller ones. It is also known as DIANA (DIvisive ANAlysis).
In this article, we have focused on unsupervised learning interview questions. In the next article, we will focus on the interview questions related to data preprocessing.
Data Science Interview Questions Part-5 (Data Preprocessing)
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…