In this tutorial, we will focus on Latent Dirichlet Allocation (LDA) and perform topic modeling using Scikit-learn. LDA is an unsupervised learning algorithm that discovers a blend of different themes or topics in a set of documents.
What is Latent Dirichlet Allocation?
Latent Dirichlet Allocation is the most popular technique for performing topic modeling. LDA is a probabilistic matrix factorization approach. LDA decomposes large dimensional Document-Term Matrix(DTM) into two lower dimensional matrices: M1 and M2.
In vector space, we can represent any text document as a document-term matrix. Here, m*n matrix has m documents D1, D2, D3 … Dm and vocabulary size of n words W1, W2, W3 .. .Wn. Each cell value is the frequency count of word Wj in Document Di.
How do LDA works?
LDA iterates for each word and tries to assign it to the best topic. The main idea behind LDA is that a document is a combination of topics and each topic is a combination of words.
LDA uses two probabilities: First, probability of words in document d that currently assigned to topic t. Second, probability of assignment of topic t to over all documents.
P1: p(topic t |document d) = Proportion of words in document d that are currently assigned to topic t.
P2: p(word w |topic t) = Proportion of assignments to topic t over all documents that come from this word w.
Implement LDA using Scikit learn
In this step, you will load the dataset. You can download data from the following link:
import os import pandas as pd from nltk.tokenize import RegexpTokenizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import LatentDirichletAllocation # Load Dataset data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False); documents_list= data['headline_text'].tolist()
Generate TF-IDF Features
In this step, you will generate the TF-IDF matrix for given documents. Here, you will also perform preprocessing operations such as tokenization, and removing stopwords.
# Initialize regex tokenizer tokenizer = RegexpTokenizer(r'\w+') # Vectorize document using TF-IDF tfidf = TfidfVectorizer(lowercase=True, stop_words='english', ngram_range = (1,1), tokenizer = tokenizer.tokenize) # Fit and Transform the documents train_data = tfidf.fit_transform(documents_list)
Scikit-learn offers LatentDirichletAllocation for performing LDA on any Document Term Matrix(DTM). Let’s see the example below(This example will take approx 25 mins on the local machine with 8GB RAM):
# Define the number of topics or components num_components=5 # Create LDA object model=LatentDirichletAllocation(n_components=num_components) # Fit and Transform SVD model on data lda_matrix = model.fit_transform(train_data) # Get Components lda_components=model.components_
Extract topics and terms
After performing LDA, we need to extract the topics from the component matrix. Let’s see the example below:
# Print the topics with their terms terms = tfidf.get_feature_names() for index, component in enumerate(lda_components): zipped = zip(terms, component) top_terms_key=sorted(zipped, key = lambda t: t, reverse=True)[:7] top_terms_list=list(dict(top_terms_key).keys()) print("Topic "+str(index)+": ",top_terms_list)
Topic 1: ['election', 'new', 'rural', 'national', 'labor', 'pm', 'says'] Topic 2: ['man', 'police', 'crash', 'charged', 'court', 'missing', 'murder'] Topic 3: ['govt', 'council', 'health', 'plan', 'new', 'water', 'boost'] Topic 4: ['country', 'interview', 'hour', 'drum', 'killed', 'police', 'death'] Topic 5: ['australia', 'day', 'world', 'interview', 'cup', 'win', 'weather']
In the above example, you can see the 5 topics. If you see keywords of Topic 1([‘election’, ‘new’, ‘rural’, ‘national’, ‘labor’, ‘pm’, ‘says’]) represents Election and Rural Issues. Similarly, Topic 2([‘man’, ‘police’, ‘crash’, ‘charged’, ‘court’, ‘missing’, ‘murder’]) is about Crime and Topic 3 is about Health and Water planning. This is how you can identify topics from the list of tags. Here we have taken 5 topics you can try with different topics and check the performance How it is making sense. For choosing a number of topics you can also use topic coherence explained in Discovering Hidden Themes of Documents article but this article is using the LSI.
In this tutorial, you covered Latent Dirichlet Allocation using Scikit learn. LSA is faster and easy to implement. LSA unable to capture the multiple semantic of words. Its accuracy is lower than LDA( Latent Dirichlet Allocation). Topic modeling offers various use cases in Resume Summarization, Search Engine Optimization, Recommender System Optimization, Improving Customer Support, and the healthcare industry.