Top-25 frequently asked data science interview questions and answers on NLP and Text Mining for fresher and experienced Data Scientist, Data analyst, statistician, and machine learning engineer job role.
Data Science is an interdisciplinary field. It uses statistics, machine learning, databases, visualization, and programming. So in this sixth article, we are focusing on NLP and Text Mining questions.
Let’s see the interview questions.
Natural language is a human language such as Hindi, English, and German. Designing computers that can understand natural languages is a challenging problem.
Computer language is a set of instructions that used to produce the desired output such as C, C++, Python, Julia, and Scala.
Natural language processing is one of the components of text mining. NLP helps identified sentiments, finding entities in the sentence, and category of blog/article. Text Mining is about exploring large textual data and find patterns. It includes finding frequent words, the length of the sentence, and the presence/absence of specific words.
NLP is a combination of NLU(Natural Language Understanding) and NLG (Natural Langauge Generation). NLU is used to understand the meaning of a given input text. It understands based on the grammar and context of the text. NLU focuses on sentiment, semantics, context, and intent. For example, the questions “what’s the weather like outside?” and “how’s the weather?” are both asking the same thing. NLG generates text based on structured data. It takes data from a search result and returns it into understandable language.
The business organization wants to understand the opinion of customers and the public. For example, what went wrong with their latest products? what users and the general public think about the latest feature? Quantifying the user’s content, idea, belief, and opinion are known as sentiment analysis. It is not only limited to marketing, but it can also be utilized in politics, research, and security. The sentiment is more than words, it is a combination of words, tone, and writing style.
Tokenization is the process of splitting text into small pieces, called tokens such as words, or sentences. It ignoring characters like punctuation marks (,. “ ‘) and spaces. Word tokenization is breaking up the text into individual words and Sentence tokenization is breaking up the text into individual sentences.
The Bag-of-words model(BoW ) is the simplest way of extracting features from the text. BoW converts text into the matrix of the occurrence/frequency of words within a document. This model concerns whether given words occurred or not in the document. It can be a combination of two or more words, which is called the bigram or trigram model and the general approach is called the n-gram model. n-gram creates a matrix is known as the Document-Term Matrix(DTM) or Term-Document Matrix(TDM).
TF-IDF(Term Frequency-Inverse Document Frequency) normalizes the document term matrix. In Term Frequency(TF), you just count the number of words that occurred in each document. IDF(Inverse Document Frequency) measures the amount of information a given word provides across the document. IDF is the logarithmically scaled inverse ratio of the number of documents that contain the word and the total number of documents. TF-IDF is the multiplication of TF and IDF.
Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect.
Lemmatization is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect.
lemmatization looks at words and their roots (called lemma) as described in the dictionary. It is more precise than stemming. Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
A word’s part of speech defines its function within a sentence. A noun, for example, identifies an object. An adjective describes an object. A verb describes the action. Identifying and tagging each word’s part of speech in the context of a sentence is called Part-of-Speech Tagging, or POS Tagging.
Named Entity recognition, also called entity detection, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text since you can quickly pick out important topics or identify key sections of text.
Topic modeling is a text mining technique that provides methods to identify co-occurring keywords to summarize large collections of textual information. It helps in discovering hidden topics in the document, annotate the documents with these topics, and organize a large amount of unstructured data.
There is a possibility that a single document can associate with multiple themes. for example, group words such as ‘patient’, ‘doctor’, ‘disease’, ‘cancer’, ad ‘health’ will represent the topic ‘healthcare’. Topic Modelling is a different game compared to rule-based text searching that uses regular expressions.
LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA uses a bag of word(BoW) model, which results in the term-document matrix (occurrence of terms in a document). rows represent terms and columns represent documents.LSA learns latent topics by performing a matrix decomposition on the document-term matrix using Singular value decomposition. LSA is typically used as a dimension reduction or noise-reducing technique. LSA uses SVD(Singular Value Decomposition) for factorizing the Term document matrix(TDM) into 3 matrices singular, diagonal, and singular matrix.
Latent Dirichlet Allocation(LDA) is an unsupervised algorithm for topic modeling. It helps users to detect the relationship between words and discover groups in those words. The core idea behind the topic modeling is that each document is represented by the distribution of topics and the topic is represented by the distribution of words. In other words, we can say first we connect words to the topic, and then topics will be connected by each document.
It is a kind of “generative probabilistic model”. LDA consists of two tables or matrices. The first table describes the probability of selecting a particular part when sampling a particular topic. The second table describes the chance of selecting a particular topic when sampling a particular document or composite.
Masked language modeling is an example of autoencoding language modeling. It is a kind of fill-in-the-blank task. It uses context words surrounded by mask tokens and predicts what word should be in the place of the mask. Masked language modeling is useful when trying to learn deep representations. BERT (Bidirectional Encoder Representations from Transformers) pre-trained model used the Masked Language model for model training.
In natural language processing, perplexity is a way of evaluating language models. A language model is a probability distribution over entire sentences or texts [Wikipedia]. Perplexity is a way to express a degree of confusion a model has in predicting. Perplexity is the exponentiation of the entropy. Low perplexity is good and high perplexity is bad since the perplexity is the exponentiation of the entropy.
Text Preprocessing involves the cleaning, normalizing, and noise removal from the given input text. text preprocessing involves the following steps:
Word embedding converts text data (e.g words, sentences, and paragraphs) into some kind of vector representation. A word having the same meaning has a similar representation. Word embedding uses an embedding layer to learn vector representation of the given text using Backpropagation. Some examples of word embedding are Word2Vec, Sent2Vec, and Doc2Vec.
There are two models for learning word2vec models:
Word embeddings use very shallow Language Models and do not consider the context of the word into account.
LSTM stands for Long Short term memory. It is an extension in RNN(Recurrent Neural Networks). It mitigates the problem of vanishing gradient descent problem of RNN. In RNN, the neural network weights become very smaller or close to zero as we move backward in the network. Due to this problem neurons in the earlier layers learn very slow. This is also the reason that we don’t use the sigmoid and Tanh activation function. Nowadays, mostly we are using ReLu(Rectified Linear Unit) in hidden layers and sigmoid at the output layer.
LSTM uses gates to overcome this problem. It uses three gates: Input, Output, and Forget gate. Here, the gates control whether to store or delete (remember or forget) the information.
In-Text Classification Problem, we have a set of texts and their respective labels. but we directly can’t use text for our model. you need to convert this text into some numbers or vectors of numbers. There are various ways to convert your text into some vector:
NLP offers the following real-life applications:
Parsing helps us to understand the structure of text data. It has capability to analyze any sentence using the parse tree and verify the grammar of a sentence. SpaCy library implemented the dependency parsing.
We can find a similarity between two texts using Cosine and Jaccard similarity.
Regular Expressions or Regex are searching patterns that used to search in textual data. It is helpful in extracting email addresses, hashtags, and phone numbers.
BiLSTM is a Bidirectional LSTM that processed signals in both forward and backward directions. It is a sequential processing model that consists of two LSTM. This bidirectional nature helps us to increase the more contextual information for the algorithm. It shows quite good results when it understands the context better.
Syntactic analysis is analyzing the sentence and understand the grammar rules and order of sentence. It includes the following tasks parsing, word segmentation, morphological segmentation, stemming, and lemmatization.
Semantic analysis analyses the meaning and interpretation of a text. It includes the following tasks such as Named entity recognition, word sense ambiguation, and Natural language generation.
In this article, we have focused on NLP and Text Analytics interview questions. In the next article, we will focus on the interview questions related to Basic Statistics.
Data Science Interview Questions Part-7(Statistics)
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…