Top-15 frequently asked data science interview questions and answers on Data preprocessing for fresher and experienced Data Scientist, Data analyst, statistician, and machine learning engineer job role.
Data Science is an interdisciplinary field. It uses statistics, machine learning, databases, visualization, and programming. So in this fifth article, we are focusing on Data Preprocessing questions.
Let’s see the interview questions.
Features are the core characteristics of any prediction that impact the results. Feature engineering is the process of creating a new feature, transforming a feature, and encoding a feature. Sometimes we also use the domain knowledge to generate new features. It prepares the data that easily input to the model and improves model performance.
In feature scaling, we change the scale of features to convert it into the same range such as (-1,1) or (0,1). This process is also known as data normalization. There are various methods for scaling or normalizing data such as min-max normalization, z-score normalization(standard scaler), and Robust scaler.
Min-max normalization performs a linear transformation on the original data and converts it into a given minimum and maximum range.
z-score normalization (or standard scaler) normalizes the data based on the mean and standard deviation.
In the data cleaning process, we can see there are lots of values that are missing or not filled or collected during the survey. WE can handle such missing values using the following methods:
Outliers are abnormal observations that deviate from the norm. Outliers do not fit in the normal behavior of the data. We can detect outliers using the following methods:
We can treat outlier by removing from the data. After detecting outliers we can filter the outliers using Z-score, Percentile, or 1.5 times of IQR.
Feature split is an approach to generate a few other features from the existing one to improve the model performance. for example, splitting name into first and last name.
We can handle skewed data using transformations such as square, log, square, square root, reciprocal (1/x), and Box-Cox Transform.
Data transformation consolidated or aggregate your data columns. It may impact your machine learning model performance. There are the following strategies to transform data:
We can select the important features using random forest, or remove redundant features using recursive feature elimination. Let’s all the categories of such methods.
You can get lots of other important features from the date such as day of the week, day of the month, day of the quarter, and day of the year. Also, you can extract the date, month, and year. These all features can impact your prediction for example sales can be impacted by month or day of the week.
Multicollinearity is a high correlation among two or more predictor variables in a multiple regression model. It is high intercorrelations or inter-association among the independent variables. It is caused by the inaccurate use of dummy variables and the repetition of the same kind of variable. Multicollinearity impacts change in the signs as well as in the magnitudes of the regression coefficients. We can detect multicollinearity using the Correlation coefficient, Variance inflation factor (VIF), and Eigenvalues. The easiest way to compute is the correlation coefficient. Let’s understand this with an example:
you have a dataset with the following columns:
suppose body surface area or weight have high correlation: Which variable needs to remove to overcome multicollinearity.
In this case, you can see the weight is easy to measure compared to body surface area(BSA). Removing a variable that has a high correlation with others depends upon domain knowledge.
heteroscedasticity is a situation where the variability of a variable is unequal across the range of values of a second variable that predicts it. We can detect heteroscedasticity using graphs or statistical tests such as Breush-Pagan test and NCV test. You can remove heteroscedasticity Box-Cox transformations and log transformations. Box-Cox transformation is a kind of power transformation that transforms data into a normal distribution.
In Regression analysis, we need to convert all the categorical columns into binary variables. Such variables are known as dummy variables. The dummy variable is also known as an indicator variable, design variable, Boolean indicator, categorical variable, binary variable, or qualitative variable. It takes only 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. K categories will takes K-1 dummy variables. For example, you can see the eye color and gender columns converted into one-hot encoded binary values.
Label encoding is a kind of integer encoding. Here, each unique category value is replaced with an integer value so that the machine can understand.
Ordinal encodings is a label encoding with an order in the encoded values.
One hot encoding is used to encode the categorical column. It replaces a categorical column with its labels and fills values either 0 or 1. For example, you can see the “color” column, there are 3 categories such as red, yellow, and green. 3 categories labeled with binary values.
In this article, we have focused on the data preprocessing interview questions. In the next article, we will focus on the interview questions related to NLP and Text Analytics.
Data Science Interview Questions Part-6 (NLP and Text Analytics)
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…