Text is an extremely rich source of information. Each minute, people send hundreds of millions of new emails and text messages. There’s a veritable mountain of text data waiting to be mined for insights. But data scientists who want to glean meaning from all of that text data face a challenge: it is difficult to analyze and process because it exists in unstructured form.
In this tutorial, we’ll take a look at how we can transform all of that unstructured text data into something more useful for analysis and natural language processing, using the helpful Python package
Specifically, we’re going to take a high-level look at natural language processing (NLP). Then we’ll work through some of the important basic operations for cleaning and analyzing text data with
What is Natural Language Processing?
Natural language processing (NLP) is a branch of machine learning that deals with processing, analyzing, and sometimes generating human speech (“natural language”). There’s no doubt that humans are still much better than machines at determining the meaning of a string of text. But in data science, we’ll often encounter data sets that are far too large to be analyzed by a human in a reasonable amount of time. We may also encounter situations where no human is available to analyze and respond to a piece of text input. In these situations, we can use natural language processing techniques to help machines get some understanding of the text’s meaning (and if necessary, respond accordingly).
For example, natural language processing is widely used in sentiment analysis, since analysts are often trying to determine the overall sentiment from huge volumes of text data that would be time-consuming for humans to comb through. It’s also used in advertisement matching — determining the subject of a body of text and assigning a relevant advertisement automatically. And it’s used in chatbots, voice assistants, and other applications where machines need to understand and quickly respond to input that comes in the form of natural human language.
Analyzing and Processing Text With SpaCy
spaCy is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently. First, let’s take a look at some of the basic analytical tasks
spaCy can handle.
We’ll need to install
spaCy and its English-language model before proceeding further. We can do this using the following command line commands:
pip install spacy python -m spacy download en
We can also use
spaCy in a Juypter Notebook. It’s not one of the pre-installed libraries that Jupyter includes by default, though, so we’ll need to run these commands from the notebook to get
spaCy installed in the correct Anaconda directory. Note that we use
! in front of each command to let the Jupyter notebook know that it should be read as a command-line command.
!pip install spacy !python -m spacy download en
Tokenizing the Text
Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces.
spaCy‘s tokenizer takes input in form of Unicode text and outputs a sequence of token objects.
Let’s take a look at a simple example. Imagine we have the following text, and we’d like to tokenize it:
There are a couple of different ways we can appoach this. The first is called word tokenization, which means breaking up the text into individual words. This is a critical step for many language processing applications, as they often require input in the form of individual words rather than long strings of text.
In the code below, we’ll import
spaCy and its English-language model, and tell it that we’ll be doing our natural language processing using that model. Then we’ll assign our text string to
nlp(text), we’ll process that text in
spaCy and assign the result to a variable called
At this point, our text has already been tokenized, but
spaCy stores tokenized text as a doc, and we’d like to look at it in list form, so we’ll create a
for loop that iterates through our doc, adding each word token it finds in our text string to a list called
token_list so that we can take a better look at how words are tokenized.
# Word tokenization from spacy.lang.en import English # Load English tokenizer, tagger, parser, NER and word vectors nlp = English() text = """When learning data science, you shouldn't get discouraged! Challenges and setbacks aren't failures, they're just part of the journey. You've got this!""" # "nlp" Object is used to create documents with linguistic annotations. my_doc = nlp(text) # Create list of word tokens token_list =  for token in my_doc: token_list.append(token.text) print(token_list)
Output: ['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']
As we can see,
spaCy produces a list that contains each token as a separate item. Notice that it has recognized that contractions such as shouldn’t actually represent two distinct words, and it has thus broken them down into two distinct tokens.
First, we need to load language dictionaries, Here in above example, we are loading english dictionary using English() class and creating nlp object. “nlp” object is used to create documents with linguistic annotations and various NLP properties. After creating a document, we are creating a token list.
If we want, we can also break the text into sentences rather than words. This is called sentence tokenization. When performing sentence tokenization, the tokenizer looks for specific characters that fall between sentences, like periods, exclamation points, and newline characters. For sentence tokenization, we will use a preprocessing pipeline because sentence preprocessing using
spaCy includes a tokenizer, a tagger, a parser and an entity recognizer that we need to access to correctly identify what’s a sentence and what isn’t.
In the code below,
spaCy tokenizes the text and creates a Doc object. This Doc object uses our preprocessing pipeline’s components tagger, parser, and entity recognizer to break the text down into components. From this pipeline, we can extract any component, but here we’re going to access sentence tokens using the
# sentence tokenization # Load English tokenizer, tagger, parser, NER and word vectors nlp = English() # Add the 'sentencizer' component to the pipeline nlp.add_pipe('sentencizer') text = """When learning data science, you shouldn't get discouraged! Challenges and setbacks aren't failures, they're just part of the journey. You've got this!""" # "nlp" Object is used to create documents with linguistic annotations. doc = nlp(text) # create list of sentence tokens sents_list =  for sent in doc.sents: sents_list.append(sent.text) print(sents_list)
Output: ["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]
spaCy has correctly parsed the text into the format we want, this time outputting a list of sentences found in our source text.
Cleaning Text Data: Removing Stopwords
Most text data that we work with is going to contain a lot of words that aren’t actually useful to us. These words, called stopwords, are useful in human speech, but they don’t have much to contribute to data analysis. Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time analysis takes (since there are fewer words to process).
Let’s take a look at the stopwords
spaCy includes by default. We’ll import
spaCy and assign the stopwords in its English-language model to a variable called
spacy_stopwords so that we can take a look.
Stop words #importing stop words from English language. from spacy.lang.en.stop_words import STOP_WORDS #Printing the total number of stop words: print('Number of stop words: %d' % len(STOP_WORDS)) #Printing first twenty stop words: print('First 20 stop words: %s' % list(STOP_WORDS)[:20])
Output: Number of stop words: 312 First 20 stop words: ['was', 'various', 'fifty', "'s", 'used', 'once', 'because', 'himself', 'can', 'name', 'many', 'seems', 'others', 'something', 'anyhow', 'nowhere', 'serious', 'forty', 'he', 'now']
As we can see,
spaCy‘s a default list of stopwords includes 312 total entries, and each entry is a single word. We can also see why many of these words wouldn’t be useful for data analysis. Transition words like nevertheless, for example, aren’t necessary for understanding the basic meaning of a sentence. And other words like somebody are too vague to be of much use for NLP tasks.
If we wanted to, we could also create our own customized list of stopwords. But for our purposes in this tutorial, the default list that
spaCy provides will be fine.
Removing Stopwords from Our Data
Now that we’ve got our list of stopwords, let’s use it to remove the stopwords from the text string we were working on in the previous section. Our text is already stored in the variable
text, so we don’t need to define that again.
Instead, we’ll create an empty list called
filtered_sent and then iterate through our
doc variable to look at each tokenized word from our source text.
spaCy includes a bunch of helpful token attributes, and we’ll use one of them called
is_stop to identify words that aren’t in the stopword list and then append them to our
from spacy.lang.en import English # Load English tokenizer, tagger, parser, NER and word vectors nlp = English() # "nlp" Object is used to create documents with linguistic annotations. doc = nlp(text) filtered_tokens= # filtering stop words and punctuations for word in doc: if word.is_stop==False: if word.is_punct==False: filtered_tokens.append(word) print("Filtered Sentence:",filtered_tokens)
Output: Filtered Sentence: [learning, data, science, discouraged, Challenges, setbacks, failures, journey, got]
It’s not too difficult to see why stopwords can be helpful. Removing them has boiled our original text down to just a few words that give us a good idea of what the sentences are discussing: learning data science, and discouraging challenges and setbacks along that journey.
Lexicon normalization is another step in the text data cleaning process. In the big picture, normalization converts high dimensional features into low dimensional features that are appropriate for any machine learning model. For our purposes here, we’re only going to look at lemmatization, a way of processing words that reduces them to their roots.
Lemmatization is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, so we need a way to change all the words that are forms of the word connect into the word connect itself.
One method for doing this is called stemming. Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect. This kind of simple stemming is often all that’s needed, but lemmatization — which actually looks at words and their roots (called lemma) as described in the dictionary — is more precise (as long as the words exist in the dictionary).
spaCy includes a built-in way to break a word down into its lemma, we can simply use that for lemmatization. In the following very simple example, we’ll use
.lemma_ to produce the lemma for each word we’re analyzing.
# importing the model en_core_web_sm of English for vocabluary, syntax & entities import en_core_web_sm # load en_core_web_sm of English for vocabluary, syntax & entities nlp = en_core_web_sm.load() # Implementing lemmatization lem = nlp("run runs running runner") # finding lemma for each word for word in lem: print(word.text,"==>" ,word.lemma_)
Output: run ==> run runs ==> run running ==> run runner ==> runner
# importing the model en_core_web_sm of English for vocabluary, syntax & entities import en_core_web_sm # load en_core_web_sm of English for vocabluary, syntax & entities nlp = en_core_web_sm.load() text = """When learning data science, you shouldn't get discouraged! Challenges and setbacks aren't failures, they're just part of the journey. You've got this!""" # "nlp" Object is used to create documents with linguistic annotations. doc = nlp(text) filtered_tokens= # filtering stop words and punctuations for word in doc: if word.is_stop==False: if word.is_punct==False: filtered_tokens.append(word) print("Filtered Tokens:",filtered_tokens) normalized_tokens= for token in filtered_tokens: normalized_tokens.append(token.lemma_) print("Lemmatized Tokens:",normalized_tokens)
Output: Filtered Tokens: [learning, data, science, discouraged, Challenges, setbacks, failures, journey, got] Lemmatized Tokens: ['learn', 'data', 'science', 'discourage', 'challenge', 'setback', 'failure', 'journey', 'get']
Part of Speech (POS) Tagging
A word’s part of speech defines its function within a sentence. A noun, for example, identifies an object. An adjective describes an object. A verb describes the action. Identifying and tagging each word’s part of speech in the context of a sentence is called Part-of-Speech Tagging, or POS Tagging.
Let’s try some POS tagging with
spaCy! We’ll need to import its
en_core_web_sm model, because that contains the dictionary and grammatical information required to do this analysis. Then all we need to do is load this model with
.load() and loop through our new
docs variable, identifying the part of speech for each word using
(Note u in
u"All is well that ends well." signifies that the string is a Unicode string.)
# POS tagging # importing the model en_core_web_sm of English for vocabluary, syntax & entities import en_core_web_sm # load en_core_web_sm of English for vocabluary, syntax & entities nlp = en_core_web_sm.load() # "nlp" Objectis used to create documents with linguistic annotations. docs = nlp(u"All is well that ends well.") for word in docs: print(word.text,word.pos_)
Output: All DET is VERB well ADV that DET ends VERB well ADV . PUNCT
spaCy has correctly identified the part of speech for each word in this sentence. Being able to identify parts of speech is useful in a variety of NLP-related contexts because it helps more accurately understand input sentences and more accurately construct output responses.
Resources and Next Steps
Over the course of this tutorial, we’ve gone from performing some very simple text analysis operations with
spaCy. Of course, this is just the beginning, and there’s a lot more in the
spaCy. In the next tutorial, we will see other important topics of NLP such as entity recognition, dependency parsing, and word vector representation.
Text Analytics for Beginners using Python spaCy Part-2
This article is originally published at https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
Reach out to me on Linkedin: https://www.linkedin.com/in/avinash-navlani/