In this tutorial, we will focus on Latent Dirichlet Allocation (LDA) and perform topic modeling using Scikit-learn. LDA is an unsupervised learning algorithm that discovers a blend of different themes or topics in a set of documents.
Latent Dirichlet Allocation is the most popular technique for performing topic modeling. LDA is a probabilistic matrix factorization approach. LDA decomposes large dimensional Document-Term Matrix(DTM) into two lower dimensional matrices: M1 and M2.
In vector space, we can represent any text document as a document-term matrix. Here, m*n matrix has m documents D1, D2, D3 … Dm and vocabulary size of n words W1, W2, W3 .. .Wn. Each cell value is the frequency count of word Wj in Document Di.
LDA iterates for each word and tries to assign it to the best topic. The main idea behind LDA is that a document is a combination of topics and each topic is a combination of words.
LDA uses two probabilities: First, probability of words in document d that currently assigned to topic t. Second, probability of assignment of topic t to over all documents.
P1: p(topic t |document d) = Proportion of words in document d that are currently assigned to topic t.
P2: p(word w |topic t) = Proportion of assignments to topic t over all documents that come from this word w.
In this step, you will load the dataset. You can download data from the following link:
import os
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Load Dataset
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
documents_list= data['headline_text'].tolist()
In this step, you will generate the TF-IDF matrix for given documents. Here, you will also perform preprocessing operations such as tokenization, and removing stopwords.
# Initialize regex tokenizer
tokenizer = RegexpTokenizer(r'\w+')
# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
stop_words='english',
ngram_range = (1,1),
tokenizer = tokenizer.tokenize)
# Fit and Transform the documents
train_data = tfidf.fit_transform(documents_list)
Scikit-learn offers LatentDirichletAllocation for performing LDA on any Document Term Matrix(DTM). Let’s see the example below(This example will take approx 25 mins on the local machine with 8GB RAM):
# Define the number of topics or components
num_components=5
# Create LDA object
model=LatentDirichletAllocation(n_components=num_components)
# Fit and Transform SVD model on data
lda_matrix = model.fit_transform(train_data)
# Get Components
lda_components=model.components_
After performing LDA, we need to extract the topics from the component matrix. Let’s see the example below:
# Print the topics with their terms
terms = tfidf.get_feature_names()
for index, component in enumerate(lda_components):
zipped = zip(terms, component)
top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:7]
top_terms_list=list(dict(top_terms_key).keys())
print("Topic "+str(index)+": ",top_terms_list)
Output:
Topic 1: ['election', 'new', 'rural', 'national', 'labor', 'pm', 'says']
Topic 2: ['man', 'police', 'crash', 'charged', 'court', 'missing', 'murder']
Topic 3: ['govt', 'council', 'health', 'plan', 'new', 'water', 'boost']
Topic 4: ['country', 'interview', 'hour', 'drum', 'killed', 'police', 'death']
Topic 5: ['australia', 'day', 'world', 'interview', 'cup', 'win', 'weather']
In the above example, you can see the 5 topics. If you see keywords of Topic 1([‘election’, ‘new’, ‘rural’, ‘national’, ‘labor’, ‘pm’, ‘says’]) represents Election and Rural Issues. Similarly, Topic 2([‘man’, ‘police’, ‘crash’, ‘charged’, ‘court’, ‘missing’, ‘murder’]) is about Crime and Topic 3 is about Health and Water planning. This is how you can identify topics from the list of tags. Here we have taken 5 topics you can try with different topics and check the performance How it is making sense. For choosing a number of topics you can also use topic coherence explained in Discovering Hidden Themes of Documents article but this article is using the LSI.
In this tutorial, you covered Latent Dirichlet Allocation using Scikit learn. LSA is faster and easy to implement. LSA unable to capture the multiple semantic of words. Its accuracy is lower than LDA( Latent Dirichlet Allocation). Topic modeling offers various use cases in Resume Summarization, Search Engine Optimization, Recommender System Optimization, Improving Customer Support, and the healthcare industry.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…