Latent Semantic Indexing using Scikit-Learn
In this tutorial, we will focus on Latent Semantic Indexing or Latent Semantic Analysis and perform topic modeling using Scikit-learn. If you want to implement topic modeling using Gensim then you can refer to this Discovering Hidden Themes of Documents article.
What is Topic Modelling?
Topic Modelling is an unsupervised technique for discovering the themes of given documents. It extracts the set of co-occurring keywords. These co-occurring keywords represent a topic. For example, stock, market, equity, mutual funds will represent the ‘stock investment’ topic.
What is Latent Semantic Indexing?
Latent Semantic Indexing(LSI) or Latent Semantic Analysis (LSA) is a technique for extracting topics from given text documents. It discovers the relationship between terms and documents. LSI concept is utilized in grouping documents, information retrieval, and recommendation engines. LSI discovers latent topics using Singular Value Decomposition.
Implement LSI using Scikit learn
In this step, you will load the dataset. You can download data from the following link:
import os import pandas as pd from nltk.tokenize import RegexpTokenizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD # Load Dataset documents_list =  with open( os.path.join("articles.txt") ,"r") as fin: for line in fin.readlines(): text = line.strip() documents_list.append(text)
Generate TF-IDF Features
In this step, you will generate the TF-IDF matrix for given documents. Here, you will also perform preprocessing operations such as tokenization, and removing stopwords.
# Initialize regex tokenizer tokenizer = RegexpTokenizer(r'\w+') # Vectorize document using TF-IDF tfidf = TfidfVectorizer(lowercase=True, stop_words='english', ngram_range = (1,1), tokenizer = tokenizer.tokenize) # Fit and Transform the documents train_data = tfidf.fit_transform(documents_list)
SVD is a matrix decomposition technique that factorizes matrix in the product of matrices. Scikit-learn offers TruncatedSVD for performing SVD. Let’s see the example below:
# Define the number of topics or components num_components=10 # Create SVD object lsa = TruncatedSVD(n_components=num_components, n_iter=100, random_state=42) # Fit SVD model on data lsa.fit_transform(train_data) # Get Singular values and Components Sigma = lsa.singular_values_ V_transpose = lsa.components_.T
Extract topics and terms
After performing SVD, we need to extract the topics from the component matrix. Let’s see the example below:
# Print the topics with their terms terms = tfidf.get_feature_names() for index, component in enumerate(lsa.components_): zipped = zip(terms, component) top_terms_key=sorted(zipped, key = lambda t: t, reverse=True)[:5] top_terms_list=list(dict(top_terms_key).keys()) print("Topic "+str(index)+": ",top_terms_list)
Topic 0: ['s', 'trump', 'said', 'eu', 't'] Topic 1: ['trump', 'clinton', 'republican', 'donald', 'cruz'] Topic 2: ['s', 'league', 'season', 'min', 'leicester'] Topic 3: ['eu', 'league', 'min', 'season', 'brexit'] Topic 4: ['bank', 'banks', 'banking', 'rbs', 'financial'] Topic 5: ['health', 'nhs', 'care', 'mental', 'patients'] Topic 6: ['min', 'ball', 'corner', 'yards', 'goal'] Topic 7: ['facebook', 'internet', 'online', 'users', 'twitter'] Topic 8: ['film', 'films', 'movie', 'women', 'director'] Topic 9: ['labour', 'party', 'bank', 'corbyn', 'film']
In the above example, you can see the 10 topics. If you see keywords of Topic 0([‘s’, ‘trump’, ‘said’, ‘EU’, ‘t’]) represents US Politics and Europe. Similarly, Topic 1 is about US Elections and Topic 2 is about Football League. This is how you can identify topics from the list of tags. Here we have taken 10 topics you can try with different topics and check the performance How it is making sense. For choosing a number of topics you can also use topic coherence explained in Discovering Hidden Themes of Documents article.
In this tutorial, you covered Latent Semantic Analysis using Scikit learn. LSA is faster and easy to implement. LSA unable to capture the multiple semantic of words. Its accuracy is lower than LDA( latent Dirichlet allocation). Topic modeling offers various usecases in Resume Summarization, Search Engine Optimization, Recommender System Optimization, Improving Customer Support, and the healthcare industry.