Latent Semantic Indexing using Scikit-Learn

In this tutorial, we will focus on Latent Semantic Indexing or Latent Semantic Analysis and perform topic modeling using Scikit-learn. If you want to implement topic modeling using Gensim then you can refer to this Discovering Hidden Themes of Documents article.

What is Topic Modelling?

Topic Modelling is an unsupervised technique for discovering the themes of given documents. It extracts the set of co-occurring keywords. These co-occurring keywords represent a topic. For example, stock, market, equity, mutual funds will represent the ‘stock investment’ topic.

What is Latent Semantic Indexing?

Latent Semantic Indexing(LSI) or Latent Semantic Analysis (LSA) is a technique for extracting topics from given text documents. It discovers the relationship between terms and documents. LSI concept is utilized in grouping documents, information retrieval, and recommendation engines. LSI discovers latent topics using Singular Value Decomposition. 

Image for post
https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

Implement LSI using Scikit learn

Load Data

In this step, you will load the dataset. You can download data from the following link:

import os
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Load Dataset
documents_list = []
with open( os.path.join("articles.txt") ,"r") as fin:
    for line in fin.readlines():
        text = line.strip()
        documents_list.append(text)
Generate TF-IDF Features

In this step, you will generate the TF-IDF matrix for given documents. Here, you will also perform preprocessing operations such as tokenization, and removing stopwords.

# Initialize regex tokenizer
tokenizer = RegexpTokenizer(r'\w+')

# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1),
                        tokenizer = tokenizer.tokenize)

# Fit and Transform the documents
train_data = tfidf.fit_transform(documents_list)   
Perform SVD

SVD is a matrix decomposition technique that factorizes matrix in the product of matrices. Scikit-learn offers TruncatedSVD for performing SVD. Let’s see the example below:

# Define the number of topics or components
num_components=10

# Create SVD object
lsa = TruncatedSVD(n_components=num_components, n_iter=100, random_state=42)

# Fit SVD model on data
lsa.fit_transform(train_data)

# Get Singular values and Components 
Sigma = lsa.singular_values_ 
V_transpose = lsa.components_.T
Extract topics and terms

After performing SVD, we need to extract the topics from the component matrix. Let’s see the example below:

# Print the topics with their terms
terms = tfidf.get_feature_names()

for index, component in enumerate(lsa.components_):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:5]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)

Output:

Topic 0:  ['s', 'trump', 'said', 'eu', 't']
Topic 1:  ['trump', 'clinton', 'republican', 'donald', 'cruz']
Topic 2:  ['s', 'league', 'season', 'min', 'leicester']
Topic 3:  ['eu', 'league', 'min', 'season', 'brexit']
Topic 4:  ['bank', 'banks', 'banking', 'rbs', 'financial']
Topic 5:  ['health', 'nhs', 'care', 'mental', 'patients']
Topic 6:  ['min', 'ball', 'corner', 'yards', 'goal']
Topic 7:  ['facebook', 'internet', 'online', 'users', 'twitter']
Topic 8:  ['film', 'films', 'movie', 'women', 'director']
Topic 9:  ['labour', 'party', 'bank', 'corbyn', 'film']

In the above example, you can see the 10 topics. If you see keywords of Topic 0([‘s’, ‘trump’, ‘said’, ‘EU’, ‘t’]) represents US Politics and Europe. Similarly, Topic 1 is about US Elections and Topic 2 is about Football League. This is how you can identify topics from the list of tags. Here we have taken 10 topics you can try with different topics and check the performance How it is making sense. For choosing a number of topics you can also use topic coherence explained in Discovering Hidden Themes of Documents article.

Summary

In this tutorial, you covered Latent Semantic Analysis using Scikit learn. LSA is faster and easy to implement. LSA unable to capture the multiple semantic of words. Its accuracy is lower than LDA( latent Dirichlet allocation). Topic modeling offers various usecases in Resume Summarization, Search Engine Optimization, Recommender System Optimization, Improving Customer Support, and the healthcare industry.

Leave a Reply

Your email address will not be published.