Top frequently asked data science interview questions(Regression Analysis) and answers for fresher and experienced Data Scientist, Data Analyst, and Machine Learning Engineer job roles.
Data Science is an interdisciplinary field. It uses statistics, machine learning, databases, visualization, and programming. In the previous article (Data Science Interview Questions Part-1) in this series, we have focused on basic data science interview questions related to domain search. So in this second article, we are focusing on basic data science questions related to Regression Analysis.
Let’s see the interview questions.
Regression analysis is a supervised statistical technique used to determine the relationship between a dependent variable and a series of independent variables.
The linear regression model follows the linear relationship between the dependent and independent variables. It uses a linear equation, Y = a +bx, where x is the independent variable and Y is the dependent variable. Linear regression is easy to use, and interpret. Non-linear regression doesn’t follow Y = a +bx equation. Non-linear regression is much more flexible in curve fitting. It can be represented as the polynomial of k degrees.
Mean Squared Error(MSE) is the average of squared errors of all values. Or in other words, we can say it is an average of squared differences between predicted and actual value.
RMSE (Root Mean Squared Error) is the square root of the average of squared differences between predicted and actual values.
RMSE increases are larger than MAE as the test sample size increases. In general, MAE is steady and RMSE increases as the variance of error magnitudes increases.
Mean Absolute Error(MAE) is the average of absolute or positive errors of all values. Or in other words, we can say it is an average of absolute or positive differences between predicted and actual value.
MAPE (Mean Absolute Percent Error) computes the average absolute error in percentage terms. It can be defined as the percentage average of absolute or positive errors.
R-square or coefficient of determination is the measure of the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model.
The main problem with the R-squared is that it will always be the same or increase with adding more variables. Here Adjusted R square can help. Adjusted R-square penalizes you for adding variables that do not improve your existing model.
correlation measures the strength or degree of relationship between two variables. It doesn’t capture causality. It is visualized by a single point.
Regression measures how one variable affects another. Regression is about model fitting. It captures the causality and shows cause and effect. It is visualized by line.
Multicollinearity can also be known as collinearity. It is a phenomenon where two or more independent variables are highly correlated i.e. one variable can be linearly predicted from the other variables. It measures the inter-correlations and inter-association among independent variables.
Multicollinearity is caused due to the inaccurate use of dummy variables or due to any variable which is computed from the other variable in the data.
It impacts regression coefficients and causes high standard errors. We can detect using the correlation coefficient, Variance inflation factor (VIF), and Eigenvalues.
Variance inflation factors (VIF) measure how much the variance of an estimated regression coefficient is increased because of collinearity. It computes how much multicollinearity exists in a regression analysis.
It performs ordinary least square regression that has Xi as a function of all the other explanatory or independent variables and then calculates VIF using the formula:
Heteroscedasticity refers to the situation where the variability of a variable is unequal across the range of values of a second variable that predicts it. WE can detect heteroscedasticity using graphs or statistical tests such as Breush-Pagan test and NCV test.
Box-cox transformation is a mathematical transformation of the variable to make it approximate to a normal distribution. Box-cox is used to transform skewed data into normally distributed data.
Linear regression has the following assumptions:
The core objective of linear regression is to find coefficients(α and β) by minimizing the error term. The model tries to minimize the sum of squared errors. This process is known as OLS. The OLS(Ordinary Least Squares) method corresponds to minimizing the sum of square differences between the observed and predicted values.
A normal distribution is a bell-shaped curve. is a distribution that occurs naturally in many situations. For example, the bell curve is seen in tests like the SAT and GRE. The bulk of students will score the average ©, while smaller numbers of students will score a B or D.
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.
A dummy variable is a categorical independent variable. In Regression analysis, such variables are known as dummy variables. It is also known as an indicator
variable, categorical variable, binary variable, or qualitative variable. n categories in a column always have n-1 dummy variables.
Random forest is a bagging algorithm that runs multiple decision trees independently in parallel. We select some samples from the data set and for each sample decision tree will be generated. In a classification problem, performs the majority voting on final predicted values of multiple trees. In a regression problem, finds the mean of the final predicted values from multiple decision trees.
It is a first-order iterative optimization technique for finding the minimum of a function. It is an efficient optimization technique to find a local or global minimum.
Types of Gradient Descent
The main disadvantage of linear regression is the assumption of linearity. It assumes a linear relationship between the input and output variables and fails to fit complex problems. It is sensitive to noise and outliers. It gets affected by multicollinearity.
Regularization is used to handle overfitting problems. It tries to balance the bias and variance. It penalizes the learning for more complex and flexible models. L1 and L2 have Commonly used regularization techniques. L1 or LASSO(Least Absolute Shrinkage and Selection Operator) regression adds the absolute value of the magnitude of coefficient as a penalty term to the loss function.
L2 or Ridge regression adds a squared magnitude of coefficient as a penalty term to the loss.
In this article, we have focused on the regression analysis interview questions. In the next article, we will focus on the interview questions related to the classification techniques.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…