In this tutorial, we will focus on the Gensim Python library for text analysis.
Gensim is an acronym for Generate Similar. It is a free Python library for natural language processing written by Radim Rehurek which is used in word embeddings, topic modeling, and text similarity. It is developed for generating word and document vectors. It also extracts the topics from textual documents. It is an open-source, scalable, robust, fast, efficient multicore Implementation, and platform-independent.
In this tutorial, we are going to cover the following topics:
Gensim is one of the powerful libraries for natural language processing. It will support Bag of Words, TFIDF, Word2Vec, Doc2Vec, and Topic modeling. Let install the library using pip
command:
pip install gensim
In this section, we will start gensim by creating a dictionary object. First, we load the text data file. you can download it from the following link.
# open the text file as an object
file = open('hamlet.txt', encoding ='utf-8')
# read the file
text=file.read()
Now, we tokenize and preprocess the data using the string function split() and simple_preprocess() function available in gensim module.
# Tokenize data: Handling punctuations and lowercasing the text
from gensim.utils import simple_preprocess
# preprocess the file to get a list of tokens
token_list =[]
for sentence in text.split('.'):
# the simple_preprocess function returns a list of each sentence
token_list.append(simple_preprocess(sentence, deacc = True))
print (token_list[:2])
Output:
[['the', 'tragedy', 'of', 'hamlet', 'prince', 'of', 'denmark', 'by', 'william', 'shakespeare', 'dramatis', 'personae', 'claudius', 'king', 'of', 'denmark'], ['marcellus', 'officer']]
In the above code block, we have tokenized and preprocessed the hamlet text data.
After tokenization and preprocessing, we will create gensim dictionary object for the above-tokenized text.
# Import gensim corpora
from gensim import corpora
# storing the extracted tokens into the dictionary
my_dictionary = corpora.Dictionary(token_list)
# print the dictionary
print(my_dictionary)
Output:
Dictionary(4593 unique tokens: ['by', 'claudius', 'denmark', 'dramatis', 'hamlet']...)
Here, gensim dictionary stores all the unique tokens.
Now, we will see how to save and load the dictionary object.
# save your dictionary to disk
my_dictionary.save('dictionary.dict')
# load back
load_dict = corpora.Dictionary.load('dictionary.dict')
print(load_dict)
Output:
Dictionary(4593 unique tokens: ['by', 'claudius', 'denmark', 'dramatis', 'hamlet']...)
The Bag-of-words model(BoW ) is the simplest way of extracting features from the text. BoW converts text into the matrix of the occurrence of words within a document. This model concerns whether given words occurred or not in the document.
Let’s create a bag of words using function doc2bow() for each tokenized sentence. Finally, we will have a list of tokens with their frequency.
# Converting to a bag of word corpus
BoW_corpus =[my_dictionary.doc2bow(sent, allow_update = True) for sent in token_list]
print(BoW_corpus[:2])
[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 3), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(13, 1), (14, 1)]]
In the above code, we have generated the bag or words. In the output, you can see the index and frequency of each token. If you want to replace the index with a token then you can try the following script:
# Word weight in Bag of Words corpus
word_weight =[]
for doc in BoW_corpus:
for id, freq in doc:
word_weight.append([my_dictionary[id], freq])
print(word_weight[:10])
[['by', 1], ['claudius', 1], ['denmark', 2], ['dramatis', 1], ['hamlet', 1], ['king', 1], ['of', 3], ['personae', 1], ['prince', 1], ['shakespeare', 1]]
Here, you can see the list of tokens with their frequency.
Let’s generate the TF-IDF features for the given BoW corpus.
from gensim.models import TfidfModel
import numpy as np
# create TF-IDF model
tfIdf = TfidfModel(BoW_corpus, smartirs ='ntc')
# TF-IDF Word Weight
weight_tfidf =[]
for doc in tfIdf[BoW_corpus]:
for id, tf_idf in doc:
weight_tfidf.append([my_dictionary[id], np.around(tf_idf, decimals = 3)])
print(weight_tfidf[:10])
Output: [['by', 0.146], ['claudius', 0.31], ['denmark', 0.407], ['dramatis', 0.339], ['hamlet', 0.142], ['king', 0.117], ['of', 0.241], ['personae', 0.339], ['prince', 0.272], ['shakespeare', 0.339]]
Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.
There are two main methods for woed2vec: Common Bag Of Words (CBOW) and Skip Gram.
Continuous Bag of Words (CBOW) predicts the current word based on four future and four history words. Skip-gram takes the current word as input and predicts the before and after the current word. In both types of methods, the neural network language model (NNLM) is used to train the model. Skip Gram works well with a small amount of data and is found to represent rare words well. On the other hand, CBOW is faster and has better representations for more frequent words.
Let’s implement gensim Word2Vec in python:
# import Word2Vec model
from gensim.models import Word2Vec
# Create Word2vec object
model = Word2Vec(sentences=token_list,
vector_size =100,
window=5,
min_count=1,
workers=4,
sg=0) # CBOW
#Save model
model.save("word2vec.model")
# Load trained Word2Vec model
model = Word2Vec.load("word2vec.model")
# Generate vector
vector = model.wv['think'] # returns numpy array
print(vector)
In the above code, we have built the Word2Vec model using Gensim. Here is the description for all the parameters:
Output:
[-0.27096474 -0.02201273 0.04375215 0.16169178 0.385864 -0.00830234
0.06216158 -0.14317605 0.17866768 0.13853565 -0.05782828 -0.24181016
-0.21526945 -0.34448552 -0.03946546 0.25111085 0.03826794 -0.31459117
0.05657561 -0.10587984 0.0904238 -0.1054946 -0.30354315 -0.12670684
-0.07937846 -0.09390186 0.01288407 -0.14465155 0.00734721 0.21977565
0.09089493 0.27880424 -0.12895903 0.03735492 -0.36632115 0.07415111
0.10245194 -0.25479802 0.04779665 -0.06959599 0.05201627 -0.08305986
-0.00901385 0.01109841 0.03884205 0.2771041 -0.17801927 -0.17918047
0.1551789 -0.04730623 -0.15239601 0.09148847 -0.16169599 0.07088429
-0.07817879 0.19048482 0.2557149 -0.2415944 0.17011274 0.11839501
0.1798175 0.05671703 0.03197689 0.27572715 -0.02063731 -0.04384637
-0.08028547 0.08083986 -0.3160063 -0.01283481 0.24992462 -0.04269576
-0.03815364 0.08519065 0.02496272 -0.07471556 0.17814435 0.1060199
-0.00525795 -0.08447327 0.09727245 0.01954588 0.055328 0.04693184
-0.04976451 -0.15165417 -0.19015886 0.16772328 0.02999189 -0.05189768
-0.0589773 0.07928728 -0.29813886 0.05149718 -0.14381753 -0.15011951
0.1745079 -0.14101334 -0.20089763 -0.13244842]
Lets find most similar words for a given word:
# Finding most similar words
model.wv.most_similar('flash')
Output:
[('natural', 0.9266888499259949),
('visage', 0.9253427982330322),
('yea', 0.9249608516693115),
('pyrrhus', 0.9225320816040039),
('osric', 0.9224336743354797),
('honest', 0.9221916198730469),
('fly', 0.9220262169837952),
('work', 0.9218302369117737),
('woman', 0.9209260940551758),
('once', 0.9208680391311646)]
In this section, we will see how Google’s pre-trained Word2Vec model can be used in Python. We are using here gensim package for an interface to word2vec. This model is trained on the vocabulary of 3 million words and phrases from 100 billion words of the Google News dataset. The vector length for each word is 300. You can download Google’s pre-trained model here.
Let’s load Google’s pre-trained model and print the shape of the vector:
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(model.wv['reforms'].shape)
Output: (300,)
GloVe stands for Global Vectors for Word Representation. It is an unsupervised learning algorithm for generating vector representations for words. You can read more about Glove in this research paper. It is a new global log-bilinear regression model for the unsupervised learning of word representations. You can use the following list of models trained on the Twitter dataset:
import gensim.downloader as api
# Download the model and return as object ready for use
model_glove_twitter = api.load("glove-twitter-25")
# Print shape of the vector
print(model_glove_twitter['reforms'].shape)
# Print vector for word 'reform'
print(model_glove_twitter['reforms'])
Output: (25,) [ 0.37207 0.91542 -1.6257 -0.15803 0.38455 -1.3252 -0.74057 -2.095 1.0401 -0.0027519 0.33633 -0.085222 -2.1703 0.91529 0.77599 -0.87018 -0.97346 0.68114 0.71777 -0.99392 0.028837 0.24823 -0.50573 -0.44954 -0.52987 ]
# get similar items
model_glove_twitter.most_similar("policies",topn=10)
Output: [('policy', 0.9484813213348389), ('reforms', 0.9403933882713318), ('laws', 0.94012051820755), ('government', 0.9230710864067078), ('regulations', 0.9168934226036072), ('economy', 0.9110006093978882), ('immigration', 0.9105909466743469), ('legislation', 0.9089651107788086), ('govt', 0.9054746627807617), ('regulation', 0.9050778746604919)]
FastText is an improvement in the Word2Vec model that is proposed by Facebook in 2016. FastText spits the words into n-gram characters instead of using the individual word. It uses the Neural Network to train the model. The core advantage of this technique is that can easily represent the rare words because some of their n-grams may also appear in other trained words. Let’s see how to use FastText with Gensim in the following section.
Import FastText
from gensim.models import FastText
# Create FastText Model object
model = FastText(vector_size=25, window=3, min_count=1) # instantiate
# Build Vocab
model.build_vocab(corpus_iterable=token_list)
# Train FastText model
model.train(corpus_iterable=token_list, total_examples=len(token_list), epochs=10) # train
model.wv['policy']
Output: array([-0.328225 , 0.2092654 , 0.09407859, -0.08436475, -0.18087168, -0.19953477, -0.3864786 , 0.08250062, 0.08613443, -0.14634965, 0.18207662, 0.20164935, 0.32687476, 0.05913997, -0.04142053, 0.01215196, 0.07229924, -0.3253025 , -0.15895212, 0.07037129, -0.02852136, 0.01954574, -0.04170248, -0.08522341, 0.06419735, -0.16668107, 0.11975338, -0.00493952, 0.0261423 , -0.07769344, -0.20510232, -0.05951802, -0.3080587 , -0.13712431, 0.18453395, 0.06305533, -0.14400929, -0.07675331, 0.03025392, 0.34340212, -0.10817952, 0.25738955, 0.00591787, -0.04097764, 0.11635819, -0.634932 , -0.367688 , -0.19727138, -0.1194628 , 0.00743668], dtype=float32)
# Finding most similar words
model.wv.most_similar('present')
Output: [('presentment', 0.999993622303009), ('presently', 0.9999920725822449), ('moment', 0.9999914169311523), ('presence', 0.9999902248382568), ('sent', 0.999988317489624), ('whose', 0.9999880194664001), ('bent', 0.9999875426292419), ('element', 0.9999874234199524), ('precedent', 0.9999873042106628), ('gent', 0.9999872446060181)]
Doc2vec is used to represent documents in the form of a vector. It is based on the generalized approach of the Word2Vec method. In order to deep dive into doc2vec, First, you should understand how to generate word to vectors (word2vec). Doc2Vec is used to predict the next word from numerous sample contexts of the original paragraph. It addresses the semantics of the text.
documents=text.split(".")
documents[:5]
from collections import namedtuple
# Transform data (you can add more data preprocessing steps)
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(documents):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
print(docs[:2])
Output:
[AnalyzedDocument(words=['the', 'tragedy', 'of', 'hamlet,', 'prince', 'of', 'denmark', 'by', 'william', 'shakespeare', 'dramatis', 'personae', 'claudius,', 'king', 'of', 'denmark'], tags=[0]), AnalyzedDocument(words=['marcellus,', 'officer'], tags=[1])]
In the above code block, first, we create the document using namedtuple
collection. NamedTuple is used to create a lightweight data structure similar to a class without defining its details. You can also say it is like dictionaries, which contain keys and values. After this, let’s create a model for Doc2Vec:
from gensim.models import doc2vec
model = doc2vec.Doc2Vec(docs,
vector_size=100,
window=5,
min_count=1,
workers=4,
dm=0) # PV-DBOW
vector=model.infer_vector(['the', 'tragedy', 'of', 'hamlet,', 'prince', 'of', 'denmark', 'by', 'william',
'shakespeare', 'dramatis', 'personae', 'claudius,', 'king', 'of', 'denmark'])
print(vector)
Output: [-1.5818793e-02 1.3085594e-02 -1.1896869e-02 -3.0695410e-03 1.5006907e-03 -1.3316960e-02 -5.6281965e-03 3.1253812e-03 -4.0207659e-03 -9.0181744e-03 1.2115648e-02 -1.2316694e-02 9.3884282e-03 -1.2136344e-02 9.3199247e-03 6.0257949e-03 -1.1087678e-02 -1.6263386e-02 3.0145817e-03 9.2168162e-03 -3.1892660e-03 2.5632046e-03 4.1057081e-03 -1.1103139e-02 -4.4368235e-03 9.3003511e-03 -1.9984354e-05 4.6007405e-03 4.5250896e-03 1.4299035e-02 6.4971978e-03 1.3330076e-02 1.6638277e-02 -8.3673699e-03 1.4617097e-03 -8.7684026e-04 -5.3776056e-04 1.2898060e-02 5.5408065e-04 6.9614425e-03 2.9868495e-03 -1.3385005e-03 -3.4805303e-03 1.0777158e-02 -1.1053825e-02 -8.0987150e-03 3.1651056e-03 -3.6159047e-04 -3.0776947e-03 4.9342304e-03 -1.1290920e-03 -4.8262491e-03 -9.2841331e-03 -1.4540913e-03 -1.0785381e-02 -1.7799810e-02 3.4300602e-04 2.4301475e-03 6.0869306e-03 -4.3078070e-03 2.9106432e-04 1.3333942e-03 -7.1321065e-03 4.3218113e-03 7.5919051e-03 1.7675487e-03 1.9759729e-03 -1.6749580e-03 2.5316922e-03 -7.4808724e-04 -7.0081712e-03 -7.2277770e-03 2.1022926e-03 -7.2621077e-04 1.6523260e-03 7.7043297e-03 4.9248277e-03 9.8303892e-03 4.2252508e-03 3.9137071e-03 -6.4144642e-03 -1.5699258e-03 1.5538614e-02 -1.8792158e-03 -2.2203794e-03 6.2514015e-04 9.6203719e-04 -1.5944529e-02 -1.8801112e-03 -2.8503922e-04 -4.4923062e-03 8.4128296e-03 -2.0803667e-03 1.6383808e-02 -1.6173380e-04 3.9917473e-03 1.2395959e-02 9.2958640e-03 -1.7370760e-03 -4.5007761e-04]
In the above code, we have built the Doc2Vec model using Gensim. Here is the description for all the parameters:
Congratulations, you have made it to the end of this tutorial!
In this article, we have learned Gensim Dictionary, Bag of Words, TFIDF, Word2Vec, and Doc2Vec. We have also focused on Google’s Word2vec, Standford’s Glove, and Facebook’s FastText. We have performed all the experiments using gensim library. Of course, this is just the beginning, and there’s a lot more that we can do using Gensim in natural language processing. You can check out this article on Topic modeling.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…