Data Science Interview Questions Part-1

September 27, 2020October 8, 2020 Avinash Navlani

Top frequently asked data science interview questions and answers for fresher and experienced Data Scientist job role.

Data Science is an interdisciplinary field. It uses statistics, machine learning, databases, visualization, and programming. So in this first article, we are focusing on basic data science questions related to domain definitions. Let’s see frequently asked interview questions for Data Scientist and Data Analyst Role.

1. What is machine learning?

Machine learning is the science of getting computers to act without being explicitly programmed. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.

“Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” — Tom Mitchell

2. What is Statistics and Data Mining?

Statistics is a branch of mathematics dealing with the collection analysis interpretation and presentation of numerical data. Statistics is the discipline that concerns the collection, organization, displaying, analysis, interpretation, and presentation of data.

“Statistics a body of methods for making wise decisions in the face of uncertainty.” — W.A. Wallis

What is supervised and unsupervised learning?

Data mining- Discovers the hidden patterns in the data.

Data Mining is a set of knowledge discovery tools that aims to explore data and extract patterns and correlations.

3. What is the classification?

Classification is a type of supervised learning. A classification problem is used when your target or dependent variable is categorical. It extracts the model and describes the classes. for example, a bank loan officer needs to analyze the loan applications as “safe” or “risky”. Here, safe and risky are two classes. Similarly, the sales manager wants to identify the customer who will purchase their products. Here purchase and not purchase are two classes. Classification may have multiple classes such as news article classification that may have multiple classes such as sports, politics, entertainment, business, and technology.

4. What is regression?

Regression is a type of supervised learning. A regression problem is used when your target or dependent variable is continuous. It extracts the model and describes the continuous behavior. for example, a retail agent wants to predict the price of a property. Here property price is a continuous variable. Similarly, stock price, temperature, and oil price.

5. What is the difference between Logistic and Linear Regression?

Linear regression is a type of regression algorithm while logistic regression is a type of classification algorithm. linear regression is used to forecast continuous variables while logistic regression is used to predict categorical variables. An example of a continuous output is house price and stock price. An example of the discrete output is predicting whether a patient has cancer or not, predicting whether the customer will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using the Maximum Likelihood Estimation (MLE) approach. Linear regression follows normal distribution while logistic regression follows binomial regression.

6. What is deep learning?

Deep Learning is a branch of neural network that deals with a ‘layered’ architecture to learn complex and complicated structures using multiple layers. It imitates the human brain to process the data and create non-linear models for decision making. Deep learning is used to train tasks such as speech, text, and image analytics. It attempts to top model a high level of abstraction in data using layer architecture.

7. What do you mean by model?

Machine learning models are something that is created in the training process. The model is the output of the machine learning algorithm. The model comprises the model parameters, coefficients, and weights captured during the training process.

8. What is data munging and data wrangling?

Data wrangling sometimes referred to as data munging

In recent years, unstructured data and diverse format data is increasing. Data Wrangling is the process of cleaning, arranging, and transforming data into the desired structure for further analysis.

In Business Intelligence, Data Wrangling is converting raw data into a form useful for aggregation/consolidation during data analysis.

Data Wrangling: the process of transforming and mapping data from one “raw” dataform into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. [Wikipedia]

9. What is Feature Engineering?

Feature engineering is all about creating, and transforming features to make a better prediction. for example, sales of a product can be impacted by day of the week so we can incorporate the day of the week from the date of purchase.

In feature engineering, we perform operations such as handling missing values, handling outliers, transformation and encoding, feature split and scaling.

The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering. — Luca Massaron

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.[Wikipedia]

10. What are RDBMS and NoSQL?

RDBMS stands for Relational Database Management System. RDBMS is a type of database management system used for tabular data. A relational database stores data in a structured format of rows and columns and creates a table.

NoSQL stands for Not only SQL. Here, data is nontabular and store data in non-relational formats such as key-value, columnar, and graph format. NoSQL databases are schema-less, easy to scale and handle 21st-century web data.

11. What do you mean by model performance?

Model performance is a very important aspect of machine learning where we assess or evaluate the build model quality. It helps us to compare the regressor or classifier. Regressors compared based on R-Square, RMSE, MAE, and MAPE. Classifiers compared based on accuracy, error, precision, recall, and f1-score.

Apart from these parameters, other factors are also important such as speed, robustness, scalability, and interoperability.

12. What is a continuous and categorical variable?

Continuous variables are numeric variables that have an infinite number of values between any two values.

Categorical variables have a finite number of distinct groups. For example, gender, marital status, and payment mode.

13. What do you mean by Natural Language Processing?

NLP or Natural Language Processing is broadly used to automate the processing of natural language data such as speech and textual data. It helps in extracting the meaning from human languages. NLP is applicable in several problematic from speech recognition, language translation, classifying documents to information extraction. Analyzing movie reviews and finding sentiments of review is one of the classic examples of NLP.

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.[Wikipedia]

14. What are Recommender Systems?

Recommender systems provide product suggestions to the consumers that are
likely to be of interest to the user such as movies, books, news articles, and other services. Recommender systems can be of three types content-based, collaborative, and hybrid recommended systems.

The content-based approach recommends items that are similar to items the
user preferred or queried in the past. It relies on product features and textual item descriptions.

The collaborative method is based on the user’s social environment. It recommends items based on the opinions of other customers who have similar tastes or preferences as the user.

15. What do you mean by Dataware house and a data mart?

A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process”

Datamart: A data mart contains a subset of corporate-wide data that is of value to a specific group of users. For example, a marketing data mart may confine its subjects to customers, items, and sales.

16. What do you mean by clustering?

Clustering is unsupervised learning because it does not have a target variable or class label. Clustering divides s given data observations into several groups (clusters) or a bunch of observations based on certain similarities. For example, segmenting customers, grouping super-market products such as cheese, meat products, appliances, etc.

Summary

In this article, we have focused on the basic data science questions related to domain definitions. In the next article, we will focus on the interview questions related to the Regression Analysis.

Data Science Interview Questions Part-2(Regression Analysis)