Text Analytics for Beginner using Python TextBlob

TextBlob is a python library for text analytics and natural language processing operations such as PoS tagging, noun phrases, sentiment analysis, parsing, and text classification.

TextBlob is easy to learn and code for beginners. TextBlob is built using NLTK and Pattern. It provides few extra functionalities with better results. NLP Operations such as semantic parsing, noun phrase extraction, sentiment analysis, and spell correction perform better with TextBlob than NLTK.

In this tutorial, we will focus on the TextBlob library. We will perform tokenization, noun phrase extraction, sentiment analysis, spell correction, translation, and text classification using TextBlob. If you want to learn Spacy and NLTK you can check here for SpaCy and NLTK articles.

In this tutorial, we are going to cover the following topics:

Installing TextBlob

We will need to install TextBlob before proceeding further. We can do this using the following command-line command:

pip install textblob

you can also install TextBlob in Juypter Notebook using ! in front of each command to let the Jupyter notebook know that it should be read as a command-line command.

!pip install textblob
textblob

Tokenization

Tokenization is the process of splitting text documents into small pieces, known as tokens. It will ignore punctuations and spaces from the text document. Let’s see a word tokenization example in the below code:

# Import TextBlob
from textblob import TextBlob

# Create TextBlob object
text = TextBlob("I want to be remembered not only as an entertainer but as a person who cared a lot, and I gave the best that I could. I tried to be the best role model that I possibly could.")

# Print the tokens
print(text.words)

Output:

['I', 'want', 'to', 'be', 'remembered', 'not', 'only', 'as', 'an', 'entertainer', 'but', 'as', 'a', 'person', 'who', 'cared', 'a', 'lot', 'and', 'I', 'gave', 'the', 'best', 'that', 'I', 'could', 'I', 'tried', 'to', 'be', 'the', 'best', 'role', 'model', 'that', 'I', 'possibly', 'could']

Lets try with sentence tokenization example in the below code cell: 
# Print the tokenized sentences
print(text.sentences)

Output: [Sentence(“I want to be remembered not only as an entertainer but as a person who cared a lot, and I gave the best that I could.”), Sentence(“I tried to be the best role model that I possibly could.”)]

Noun Phrases

A noun phrase is a set of words that belongs to a noun. It can be a subject or object in the sentence. Let’s see an example in the below code:

# Print noun phrases
print(text.noun_phrases)

Output: [‘role model’]

Part of Speech (POS) Tagging

Part of speech or PoS defines the function of any sentence. For example, the verb identifies the action, noun or adjective identifies the object. Discovering such labels into the data is called PoS tagging. Let’s see an example in the below code:

#Print PoS tags
print(text.tags)

Output: [(‘I’, ‘PRP’), (‘want’, ‘VBP’), (‘to’, ‘TO’), (‘be’, ‘VB’), (‘remembered’, ‘VBN’), (‘not’, ‘RB’), (‘only’, ‘RB’), (‘as’, ‘IN’), (‘an’, ‘DT’), (‘entertainer’, ‘NN’), (‘but’, ‘CC’), (‘as’, ‘IN’), (‘a’, ‘DT’), (‘person’, ‘NN’), (‘who’, ‘WP’), (‘cared’, ‘VBD’), (‘a’, ‘DT’), (‘lot’, ‘NN’), (‘and’, ‘CC’), (‘I’, ‘PRP’), (‘gave’, ‘VBD’), (‘the’, ‘DT’), (‘best’, ‘JJS’), (‘that’, ‘IN’), (‘I’, ‘PRP’), (‘could’, ‘MD’), (‘I’, ‘PRP’), (‘tried’, ‘VBD’), (‘to’, ‘TO’), (‘be’, ‘VB’), (‘the’, ‘DT’), (‘best’, ‘JJS’), (‘role’, ‘NN’), (‘model’, ‘NN’), (‘that’, ‘IN’), (‘I’, ‘PRP’), (‘possibly’, ‘RB’), (‘could’, ‘MD’)]

Lemmatization

Lemmatization is a process of normalizing the text in a linguistic manner. It chops the given input text and provides the root word of a given word with the use of a vocabulary and morphological analysis. Let’s see an example in the below code:

print(text.words[15].lemmatize("v"))

Output: care

# Import word
from textblob import Word

# Create Word object
w = Word("remembered")

# Print lemmatized word
print(w.lemmatize("v")) 

Output: remember

Finding a word and counting its occurrence

TextBlob has a find() function for searching the word and a count() function for counting the occurrence of any word. Let’s see an example in the below code:

n-grams

n-grams or bag of word model is used to find the frequency of words in a given text document. Let’s see an example in the below code:

# Count number of times I appeared
print(text.words.count('I'))

Output: 5

Sentiment Analysis

In TextBlob, sentiment property returns two scores(polarity, subjectivity) in namedtuple. The polarity score lies between -1 to +1. Negative values show negative sentiment or opinion while positive e values show positive opinion or sentiment. The Subjectivity range between 0 and 1. Here, zero means objective and 1 means subjective opinion.

TextBlob offers two implementations of sentiment analysis. One is based on pattern library and the other is based on NLTK classifier trained on a movie reviews corpus. Let’s see an example in the below code:

# Print the polarity and subjectivity
print(text.sentiment)

Output: Sentiment(polarity=0.5, subjectivity=0.65)

Spell Correction

TextBlob offers spell correction using the correct() function. Let’s see an example in the below code:

# Create TextBlob object
b = TextBlob("I havv goood speling!")
print(b.correct())

Output: I have good spelling!

Language Detection and Translation

TextBlob offers detect_language() function for detection languages and translate() for translate text from one language to another language. It uses Google Translate API. To run these functions, it requires an internet connection.

# Create TextBlob object
text = TextBlob("नमस्ते, आप कैसे हैं?")

# Detect Language
print(text.detect_language())

# Translate into english
print(text.translate(to='en'))

Output: hi
Hello. How are you?

Text Classification using TextBlob

In this section, we will focus on text classification that is one of the most important NLP techniques. Text classification will help us in various applications such as document classification, sentiment classification, predicting review rating, spam filtering, support ticket, snews classification, and fake news classification.

Prepare Dataset

In this section, our main objective is to prepare dataset. Let’s prepare data by writing sentences and its sentiment in a tuple:

train = [
...     ('I love this sandwich.', 'pos'),
...     ('this is an amazing place!', 'pos'),
...     ('I feel very good about these beers.', 'pos'),
...     ('this is my best work.', 'pos'),
...     ("what an awesome view", 'pos'),
...     ('I do not like this restaurant', 'neg'),
...     ('I am tired of this stuff.', 'neg'),
...     ("I can't deal with this", 'neg'),
...     ('he is my sworn enemy!', 'neg'),
...     ('my boss is horrible.', 'neg')
... ]

test = [
...     ('the beer was good.', 'pos'),
...     ('I do not enjoy my job', 'neg'),
...     ("I ain't feeling dandy today.", 'neg'),
...     ("I feel amazing!", 'pos'),
...     ('Gary is a friend of mine.', 'pos'),
...     ("I can't believe I'm doing this.", 'neg')
... ]

Train Model

In this section, we are going to create a NaiveBayes classifier using TextBlob. Lets create NaiveBayes classifier and train the model.

# Import NaiveBayes Classifier
from textblob.classifiers import NaiveBayesClassifier

# Perofrm model training
cl = NaiveBayesClassifier(train) 

Make Prediction

Let’s make prediction on given input sentence in the below code:

# Make prediction
print(cl.classify("This is an amazing library!"))

Output: ‘pos’

print(cl.classify("Gary is a friend of mine."))

Output: ‘neg’

Evaluate Model

Let’s evaluate the model performance using the accuracy method:

# Evaluate the model
cl.accuracy(test) 

Output: 0.8334

In the above code, we have assessed the performance using accuracy measure and we have got 83.33 % accuracy.

Retraining Model

Let’s retrain the model using update method. First, we sill prepare the new dataset and then update the previously trained mode.

# Prepare new data
new_data = [('She is my best friend.', 'pos'),
            ("I'm happy to have a new friend.", 'pos'),
            ("Stay thirsty, my friend.", 'pos'),
            ("He ain't from around here.", 'neg')]

# Update model with new data
cl.update(new_data) # 4. retraining of model

# Test the model 
cl.classify("Gary is a friend of mine.")

Output: pos

Decision Tree Classifier

Let’s train the model using Decision Tree Classifier using TextBlob and evaluate the model performance using the accuracy method.

# Import Decision Tree
from textblob.classifiers import DecisionTreeClassifier

# Create Decision Tree Classifier
dt=DecisionTreeClassifier(train)

# Test the model
dt.accuracy(test)

Output: 0.834

Pros and Cons

TextBlob is built on top of the NLTK and Pattern library. It provides a simple intuitive interface for beginners. It also offers language detection, language translation (powered by Google Translate), Sentiment analysis, and easy-to-use Text Classification functionality.

TextBlob is slower than Spacy but faster than NLTK. It does not offer few nlp tasks such as word vectorization and dependency parsing.

Summary

Congratulations, you have made it to the end of this tutorial!

In this article, we have learned the basics of the TextBlob library. We have performed various NLP operations such as PoS tagging, noun phrases, sentiment analysis, parsing, spell correction, language detection, language translation, and text classification using Naive Bayes and Decision Tree. Of course, this is just the beginning, and there’s a lot more than TextBlob has to offer for Python data scientists.

References

Leave a Reply

Your email address will not be published. Required fields are marked *