Text Analytics for Beginners using Python spaCy Part-1

September 24, 2020 Avinash Navlani

Learn the basics of the most powerful NLP library Spacy.

Text is an extremely rich source of information. Each minute, people send hundreds of millions of new emails and text messages. There’s a veritable mountain of text data waiting to be mined for insights. But data scientists who want to glean meaning from all of that text data face a challenge: it is difficult to analyze and process because it exists in unstructured form.

In this tutorial, we’ll take a look at how we can transform all of that unstructured text data into something more useful for analysis and natural language processing, using the helpful Python package spaCy (documentation).

Specifically, we’re going to take a high-level look at natural language processing (NLP). Then we’ll work through some of the important basic operations for cleaning and analyzing text data with spaCy.

In this tutorial, we are going to cover the following topics:

Contents hide

1 What is Natural Language Processing?

2 Analyzing and Processing Text With SpaCy

2.1 Installing SpaCy

3 Tokenizing the Text

4 Cleaning Text Data: Removing Stopwords

4.1 Removing Stopwords from Our Data

5 Lexicon Normalization

5.1 Lemmatization

6 Part of Speech (POS) Tagging

7 Resources and Next Steps

What is Natural Language Processing?

Natural language processing (NLP) is a branch of machine learning that deals with processing, analyzing, and sometimes generating human speech (“natural language”). There’s no doubt that humans are still much better than machines at determining the meaning of a string of text. But in data science, we’ll often encounter data sets that are far too large to be analyzed by a human in a reasonable amount of time. We may also encounter situations where no human is available to analyze and respond to a piece of text input. In these situations, we can use natural language processing techniques to help machines get some understanding of the text’s meaning (and if necessary, respond accordingly).

For example, natural language processing is widely used in sentiment analysis, since analysts are often trying to determine the overall sentiment from huge volumes of text data that would be time-consuming for humans to comb through. It’s also used in advertisement matching — determining the subject of a body of text and assigning a relevant advertisement automatically. And it’s used in chatbots, voice assistants, and other applications where machines need to understand and quickly respond to input that comes in the form of natural human language.

Analyzing and Processing Text With SpaCy

spaCy is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently. First, let’s take a look at some of the basic analytical tasks spaCy can handle.

Installing SpaCy

We’ll need to install spaCy and its English-language model before proceeding further. We can do this using the following command line commands:

pip install spacy

python -m spacy download en

We can also use spaCy in a Juypter Notebook. It’s not one of the pre-installed libraries that Jupyter includes by default, though, so we’ll need to run these commands from the notebook to get spaCy installed in the correct Anaconda directory. Note that we use ! in front of each command to let the Jupyter notebook know that it should be read as a command-line command.

!pip install spacy
!python -m spacy download en

Tokenizing the Text

Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces. spaCy‘s tokenizer takes input in form of Unicode text and outputs a sequence of token objects.

Let’s take a look at a simple example. Imagine we have the following text, and we’d like to tokenize it:

There are a couple of different ways we can approach this. The first is called word tokenization, which means breaking up the text into individual words. This is a critical step for many language processing applications, as they often require input in the form of individual words rather than long strings of text.

In the code below, we’ll import spaCy and its English-language model, and tell it that we’ll be doing our natural language processing using that model. Then we’ll assign our text string to text. Using nlp(text), we’ll process that text in spaCy and assign the result to a variable called my_doc.

At this point, our text has already been tokenized, but spaCy stores tokenized text as a doc, and we’d like to look at it in list form, so we’ll create a for loop that iterates through our doc, adding each word token it finds in our text string to a list called token_list so that we can take a better look at how words are tokenized.

# Word tokenization
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# Create list of word tokens
token_list = []

for token in my_doc:
    token_list.append(token.text)

print(token_list)

Output:
['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']

As we can see, spaCy produces a list that contains each token as a separate item. Notice that it has recognized that contractions such as shouldn’t actually represent two distinct words, and it has thus broken them down into two distinct tokens.

First, we need to load language dictionaries, Here in the above example, we are loading an English dictionary using the English() class and creating NLP object. “nlp” object is used to create documents with linguistic annotations and various NLP properties. After creating a document, we are creating a token list.

If we want, we can also break the text into sentences rather than words. This is called sentence tokenization. When performing sentence tokenization, the tokenizer looks for specific characters that fall between sentences, like periods, exclamation points, and newline characters. For sentence tokenization, we will use a preprocessing pipeline because sentence preprocessing using spaCy includes a tokenizer, a tagger, a parser, and an entity recognizer that we need to access to correctly identify what’s a sentence and what isn’t.

In the code below,spaCy tokenizes the text and creates a Doc object. This Doc object uses our preprocessing pipeline’s components tagger, parser, and entity recognizer to break the text down into components. From this pipeline, we can extract any component, but here we’re going to access sentence tokens using the sentencizer component.

# sentence tokenization

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

# Add the 'sentencizer' component to the pipeline
nlp.add_pipe('sentencizer')

text = """When learning data science, you shouldn't get discouraged! Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []

for sent in doc.sents:
    sents_list.append(sent.text)
    
print(sents_list)

Output:
["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]

Again, spaCy has correctly parsed the text into the format we want, this time outputting a list of sentences found in our source text.

Cleaning Text Data: Removing Stopwords

Most text data that we work with is going to contain a lot of words that aren’t actually useful to us. These words, called stopwords, are useful in human speech, but they don’t have much to contribute to data analysis. Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time analysis takes (since there are fewer words to process).

Let’s take a look at the stopwords spaCy includes by default. We’ll import spaCy and assign the stopwords in its English-language model to a variable called spacy_stopwords so that we can take a look.

#Stop words

#importing stop words from English language.
from spacy.lang.en.stop_words import STOP_WORDS

#Printing the total number of stop words:
print('Number of stop words: %d' % len(STOP_WORDS))

#Printing first twenty stop words:
print('First 20 stop words: %s' % list(STOP_WORDS)[:20])

Output:
Number of stop words: 312

First 20 stop words: ['was', 'various', 'fifty', "'s", 'used', 'once', 'because', 'himself', 'can', 'name', 'many', 'seems', 'others', 'something', 'anyhow', 'nowhere', 'serious', 'forty', 'he', 'now']

As we can see, spaCy‘s a default list of stopwords includes 312 total entries, and each entry is a single word. We can also see why many of these words wouldn’t be useful for data analysis. Transition words like nevertheless, for example, aren’t necessary for understanding the basic meaning of a sentence. And other words like somebody are too vague to be of much use for NLP tasks.

If we wanted to, we could also create our own customized list of stopwords. But for our purposes in this tutorial, the default list that spaCy provides will be fine.

Removing Stopwords from Our Data

Now that we’ve got our list of stopwords, let’s use it to remove the stopwords from the text string we were working on in the previous section. Our text is already stored in the variable text, so we don’t need to define that again.

Instead, we’ll create an empty list called filtered_sent and then iterate through our doc variable to look at each tokenized word from our source text. spaCy includes a bunch of helpful token attributes, and we’ll use one of them called is_stop to identify words that aren’t in the stopword list and then append them to our filtered_sent list.

from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

filtered_tokens=[]

# filtering stop words and punctuations
for word in doc:
    if word.is_stop==False:
        if word.is_punct==False:
            filtered_tokens.append(word)

print("Filtered Sentence:",filtered_tokens)

Output:
Filtered Sentence: [learning, data, science, discouraged, Challenges, setbacks, failures, journey, got]

It’s not too difficult to see why stopwords can be helpful. Removing them has boiled our original text down to just a few words that give us a good idea of what the sentences are discussing: learning data science, and discouraging challenges and setbacks along that journey.

Lexicon Normalization

Lexicon normalization is another step in the text data cleaning process. In the big picture, normalization converts high dimensional features into low dimensional features that are appropriate for any machine learning model. For our purposes here, we’re only going to look at lemmatization, a way of processing words that reduces them to their roots.

Lemmatization

Lemmatization is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, so we need a way to change all the words that are forms of the word connect into the word connect itself.

One method for doing this is called stemming. Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect. This kind of simple stemming is often all that’s needed, but lemmatization — which actually looks at words and their roots (called lemma) as described in the dictionary — is more precise (as long as the words exist in the dictionary).

Since spaCy includes a built-in way to break a word down into its lemma, we can simply use that for lemmatization. In the following very simple example, we’ll use .lemma_ to produce the lemma for each word we’re analyzing.

# importing the model en_core_web_sm of English for vocabluary, syntax & entities
import en_core_web_sm

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()

# Implementing lemmatization
lem = nlp("run runs running runner")

# finding lemma for each word
for word in lem:
    print(word.text,"==>" ,word.lemma_)

Output:
run ==> run 
runs ==> run 
running ==> run 
runner ==> runner

# importing the model en_core_web_sm of English for vocabluary, syntax & entities
import en_core_web_sm

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()

text = """When learning data science, you shouldn't get discouraged! Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

filtered_tokens=[]

# filtering stop words and punctuations
for word in doc:
    if word.is_stop==False:
        if word.is_punct==False:
            filtered_tokens.append(word)

print("Filtered Tokens:",filtered_tokens)

normalized_tokens=[]
for token in filtered_tokens:
    normalized_tokens.append(token.lemma_)
    
print("Lemmatized Tokens:",normalized_tokens)

Output: 
Filtered Tokens: [learning, data, science, discouraged, Challenges, setbacks, failures, journey, got]
Lemmatized Tokens: ['learn', 'data', 'science', 'discourage', 'challenge', 'setback', 'failure', 'journey', 'get']

Part of Speech (POS) Tagging

A word’s part of speech defines its function within a sentence. A noun, for example, identifies an object. An adjective describes an object. A verb describes the action. Identifying and tagging each word’s part of speech in the context of a sentence is called Part-of-Speech Tagging, or POS Tagging.

Let’s try some POS tagging with spaCy! We’ll need to import its en_core_web_sm model, because that contains the dictionary and grammatical information required to do this analysis. Then all we need to do is load this model with .load() and loop through our new docs variable, identifying the part of speech for each word using .pos_.

(Note u in u"All is well that ends well." signifies that the string is a Unicode string.)

# POS tagging

# importing the model en_core_web_sm of English for vocabluary, syntax & entities
import en_core_web_sm   

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()  

# "nlp" Objectis used to create documents with linguistic annotations.
docs = nlp(u"All is well that ends well.")

for word in docs:
    print(word.text,word.pos_)

Output:
All DET
is VERB
well ADV
that DET
ends VERB
well ADV
. PUNCT

Hooray! spaCy has correctly identified the part of speech for each word in this sentence. Being able to identify parts of speech is useful in a variety of NLP-related contexts because it helps more accurately understand input sentences and more accurately construct output responses.

Resources and Next Steps

Over the course of this tutorial, we’ve gone from performing some very simple text analysis operations with spaCy. Of course, this is just the beginning, and there’s a lot more in thespaCy. In the next tutorial, we will see other important topics of NLP such as entity recognition, dependency parsing, and word vector representation.

Text Analytics for Beginners using Python spaCy Part-2

This article is originally published at https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

Reach out to me on Linkedin: https://www.linkedin.com/in/avinash-navlani/