Text Analytics for Beginner using Python TextBlob
TextBlob is a python library for text analytics and natural language processing operations such as PoS tagging, noun phrases, sentiment analysis, parsing, and text classification.
TextBlob is easy to learn and code for beginners. TextBlob is built using NLTK and Pattern. It provides a few extra functionalities with better results. NLP Operations such as semantic parsing, noun phrase extraction, sentiment analysis, and spell correction perform better with TextBlob than NLTK.
In this tutorial, we will focus on the TextBlob library. We will perform tokenization, noun phrase extraction, sentiment analysis, spell correction, translation, and text classification using TextBlob. If you want to learn Spacy and NLTK you can check here for SpaCy and NLTK articles.
In this tutorial, we are going to cover the following topics:
We will need to install
TextBlob before proceeding further. We can do this using the following command-line command:
pip install textblob
you can also install TextBlob in Juypter Notebook using
! in front of each command to let the Jupyter notebook know that it should be read as a command-line command.
!pip install textblob
Tokenization is the process of splitting text documents into small pieces, known as tokens. It will ignore punctuations and spaces from the text document. Let’s see a word tokenization example in the below code:
# Import TextBlob from textblob import TextBlob # Create TextBlob object text = TextBlob("I want to be remembered not only as an entertainer but as a person who cared a lot, and I gave the best that I could. I tried to be the best role model that I possibly could.") # Print the tokens print(text.words)
['I', 'want', 'to', 'be', 'remembered', 'not', 'only', 'as', 'an', 'entertainer', 'but', 'as', 'a', 'person', 'who', 'cared', 'a', 'lot', 'and', 'I', 'gave', 'the', 'best', 'that', 'I', 'could', 'I', 'tried', 'to', 'be', 'the', 'best', 'role', 'model', 'that', 'I', 'possibly', 'could'] Lets try with sentence tokenization example in the below code cell:
# Print the tokenized sentences print(text.sentences)
Output: [Sentence(“I want to be remembered not only as an entertainer but as a person who cared a lot, and I gave the best that I could.”), Sentence(“I tried to be the best role model that I possibly could.”)]
A noun phrase is a set of words that belongs to a noun. It can be a subject or object in the sentence. Let’s see an example in the below code:
# Print noun phrases print(text.noun_phrases)
Output: [‘role model’]
Part of Speech (POS) Tagging
Part of speech or PoS defines the function of any sentence. For example, the verb identifies the action, noun or adjective identifies the object. Discovering such labels into the data is called PoS tagging. Let’s see an example in the below code:
#Print PoS tags print(text.tags)
Output: [(‘I’, ‘PRP’), (‘want’, ‘VBP’), (‘to’, ‘TO’), (‘be’, ‘VB’), (‘remembered’, ‘VBN’), (‘not’, ‘RB’), (‘only’, ‘RB’), (‘as’, ‘IN’), (‘an’, ‘DT’), (‘entertainer’, ‘NN’), (‘but’, ‘CC’), (‘as’, ‘IN’), (‘a’, ‘DT’), (‘person’, ‘NN’), (‘who’, ‘WP’), (‘cared’, ‘VBD’), (‘a’, ‘DT’), (‘lot’, ‘NN’), (‘and’, ‘CC’), (‘I’, ‘PRP’), (‘gave’, ‘VBD’), (‘the’, ‘DT’), (‘best’, ‘JJS’), (‘that’, ‘IN’), (‘I’, ‘PRP’), (‘could’, ‘MD’), (‘I’, ‘PRP’), (‘tried’, ‘VBD’), (‘to’, ‘TO’), (‘be’, ‘VB’), (‘the’, ‘DT’), (‘best’, ‘JJS’), (‘role’, ‘NN’), (‘model’, ‘NN’), (‘that’, ‘IN’), (‘I’, ‘PRP’), (‘possibly’, ‘RB’), (‘could’, ‘MD’)]
Lemmatization is a process of normalizing the text in a linguistic manner. It chops the given input text and provides the root word of a given word with the use of a vocabulary and morphological analysis. Let’s see an example in the below code:
# Import word from textblob import Word # Create Word object w = Word("remembered") # Print lemmatized word print(w.lemmatize("v"))
Finding a word and counting its occurrence
TextBlob has a find() function for searching the word and a count() function for counting the occurrence of any word. Let’s see an example in the below code:
# Find a string text.find("care") # returns the start index of that string in original text
n-grams or bag of word model is used to find the frequency of words in a given text document. Let’s see an example in the below code:
# Count number of times I appeared print(text.words.count('I'))
In TextBlob, sentiment property returns two scores(polarity, subjectivity) in namedtuple. The polarity score lies between -1 to +1. Negative values show negative sentiment or opinion while positive values show positive opinion or sentiment. The Subjectivity range between 0 and 1. Here, zero means objective and 1 means subjective opinion.
TextBlob offers two implementations of sentiment analysis. One is based on a pattern library and the other is based on an NLTK classifier trained on a movie reviews corpus. Let’s see an example in the below code:
# Print the polarity and subjectivity print(text.sentiment)
Output: Sentiment(polarity=0.5, subjectivity=0.65)
TextBlob offers spell correction using the correct() function. Let’s see an example in the below code:
# Create TextBlob object b = TextBlob("I havv goood speling!") print(b.correct())
Output: I have good spelling!
Language Detection and Translation
TextBlob offers detect_language() function for detection languages and translate() for translate text from one language to another language. It uses Google Translate API. To run these functions, requires an internet connection.
# Create TextBlob object text = TextBlob("नमस्ते, आप कैसे हैं?") # Detect Language print(text.detect_language()) # Translate into english print(text.translate(to='en'))
Hello. How are you?
Text Classification using TextBlob
In this section, we will focus on text classification which is one of the most important NLP techniques. Text classification will help us in various applications such as document classification, sentiment classification, predicting review rating, spam filtering, support tickets classification, and fake news classification.
In this section, our main objective is to prepare a dataset. Let’s prepare data by writing sentences and their sentiment in a tuple:
train = [ ... ('I love this sandwich.', 'pos'), ... ('this is an amazing place!', 'pos'), ... ('I feel very good about these beers.', 'pos'), ... ('this is my best work.', 'pos'), ... ("what an awesome view", 'pos'), ... ('I do not like this restaurant', 'neg'), ... ('I am tired of this stuff.', 'neg'), ... ("I can't deal with this", 'neg'), ... ('he is my sworn enemy!', 'neg'), ... ('my boss is horrible.', 'neg') ... ] test = [ ... ('the beer was good.', 'pos'), ... ('I do not enjoy my job', 'neg'), ... ("I ain't feeling dandy today.", 'neg'), ... ("I feel amazing!", 'pos'), ... ('Gary is a friend of mine.', 'pos'), ... ("I can't believe I'm doing this.", 'neg') ... ]
In this section, we are going to create a NaiveBayes classifier using TextBlob. Let’s create a NaiveBayes classifier and train the model.
# Import NaiveBayes Classifier from textblob.classifiers import NaiveBayesClassifier # Perofrm model training cl = NaiveBayesClassifier(train)
Let’s make prediction on the given input sentence in the below code:
# Make prediction print(cl.classify("This is an amazing library!"))
print(cl.classify("Gary is a friend of mine."))
Let’s evaluate the model performance using the accuracy method:
# Evaluate the model cl.accuracy(test)
In the above code, we have assessed the performance using accuracy measure and we have got 83.33 % accuracy.
Let’s retrain the model using the update method. First, we will prepare the new dataset and then update the previously trained model.
# Prepare new data new_data = [('She is my best friend.', 'pos'), ("I'm happy to have a new friend.", 'pos'), ("Stay thirsty, my friend.", 'pos'), ("He ain't from around here.", 'neg')] # Update model with new data cl.update(new_data) # 4. retraining of model # Test the model cl.classify("Gary is a friend of mine.")
Calculate Class Probabilities
We can also calculate the probabilities for predicted classes using the prob_classify(text) function. Let’s see the example below for detailed understanding:
cl = NaiveBayesClassifier(train) prob_dist = cl.prob_classify("I feel happy this morning.") print("Positive and Negative Probabilities:",prob_dist.prob("pos"),prob_dist.prob("neg")) print("Largest Probability:",prob_dist.max())
Positive and Negative Probabilities: 0.9256990307165033 0.07430096928349576 Largest Probability: pos
Decision Tree Classifier
Let’s train the model using the Decision Tree Classifier using TextBlob and evaluate the model performance using the accuracy method.
# Import Decision Tree from textblob.classifiers import DecisionTreeClassifier # Create Decision Tree Classifier dt=DecisionTreeClassifier(train) # Test the model dt.accuracy(test)
Maximum Entropy Classifier
Let’s train the model using Maximum Entropy Classifier using TextBlob and evaluate the model performance using the accuracy method.
# Import MaxEntClassifier from textblob.classifiers import MaxEntClassifier # Create Decision Tree Classifier me = MaxEntClassifier(train) # Test the model print(me.accuracy(test))
Pros and Cons
TextBlob is built on top of the NLTK and Pattern library. It provides a simple intuitive interface for beginners. It also offers language detection, language translation (powered by Google Translate), Sentiment analysis, and easy-to-use Text Classification functionality.
TextBlob is slower than Spacy but faster than NLTK. It does not offer a few NLP tasks such as word vectorization and dependency parsing.
Congratulations, you have made it to the end of this tutorial!
In this article, we have learned the basics of the TextBlob library. We have performed various NLP operations such as PoS tagging, noun phrases, sentiment analysis, parsing, spell correction, language detection, language translation, and text classification using Naive Bayes and Decision Tree. Of course, this is just the beginning, and there’s a lot more than TextBlob has to offer for Python data scientists.