Basic NLP Preprocessing | spaCy

In this article, we’ll discuss some of the NLP preprocessing techniques while handling the text data.

As you have guessed from the title we’ll use spaCy for most of our tasks in this article. So, if you don’t have it installed see the spaCy installation instructions to get spaCy on your computer.

Tokenization:

When we are working with text data, the first thing we have to do is to divide the text into list of tokens. This is called tokenization. Tokenization is the process of breaking up a text sequence into tokens which can be sentences, words, numbers or punctuation marks.

Performing tokenization using spacy is very straightforward.

import spacy nlp = spacy.load(‘en’) doc = nlp(‘I am going to Chennai tomorrow.’) print([token.text for token in doc])

import spacy

nlp = spacy.load(‘en’)

doc = nlp(‘I am going to Chennai tomorrow.’)

print([token.text for token in doc])

[‘I’, ‘am’, ‘going’, ‘to’, ‘Chennai’, ‘tomorrow’, ‘.’]

[‘I’, ‘am’, ‘going’, ‘to’, ‘Chennai’, ‘tomorrow’, ‘.’]

You might think that tokenizing is just splitting the word by white spaces or by punctuations like ‘.’ or ‘,’. However it is more complex than that.

Let’s try to tokenize the sentence ‘I have been to the U.K., U.S.A. and France.’

doc = nlp(‘I have been to the U.K., U.S.A. and France.’) print([token.text for token in doc])

doc = nlp(‘I have been to the U.K., U.S.A. and France.’)

print([token.text for token in doc])

[‘I’, ‘have’, ‘been’, ‘to’, ‘the’, ‘U.K.’, ‘,’, ‘U.S.A.’, ‘and’, ‘France’, ‘.’]

[‘I’, ‘have’, ‘been’, ‘to’, ‘the’, ‘U.K.’, ‘,’, ‘U.S.A.’, ‘and’, ‘France’, ‘.’]

As you can see from the result, the tokenizer identifies the word the U.K and U.S.A as a single entity instead of ‘U’, ‘.’ and ‘K’.

It also identifies the period which followed France denotes the end of a sentence and should be treated as a separate token.

Stemming and Lemmatization:

Stemming and Lemmatisation are two different but very similar methods used to convert a word to its root or base form.

However, the difference between stemming and lemmatization is that stemming is rule-based where we’ll trim or append modifiers that indicate its root word while lemmatization is the process of reducing a word to its canonical form called a lemma.

Typically lemmatization produces a meaningful base form compared to stemming.

Since Spacy doesn’t have stemming we’ll use NLTK to perform stemming.

NLTK provides several famous stemmers like Lancaster, porter, and snowball. The Porter stemmer works very well in many cases so we’ll use it to extract stems from the sentence.

To learn more about the rules of Porter Stemming visit this link.

from nltk.stem import PorterStemmer pst = PorterStemmer() words = [‘went’, ‘gone’, ‘going’, ‘waiting’] for w in words: print(w, ‘\t’, pst.stem(w))

from nltk.stem import PorterStemmer

pst = PorterStemmer()

words = [‘went’, ‘gone’, ‘going’, ‘waiting’]

for w in words:

print(w, ‘\t’, pst.stem(w))

went went gone gone going go waiting wait

went went

gone gone

going go

waiting wait

Since stemming typically removes the last few characters, it might, at times produce results that have no meaning to us as humans. Take a look at the following example.

words = [‘goes’, ‘troubled’] for w in words: print(w, ‘\t’, pst.stem(w))

words = [‘goes’, ‘troubled’]

for w in words:

print(w, ‘\t’, pst.stem(w))

goes goe troubled troubl

goes goe

troubled troubl

Now, let’s see how to perform lemmatization using spaCy. We can get the lemma for each token by using the lemma_ attr.

doc = nlp(‘went gone go goes going’) for token in doc: print(token.text, ‘\t’, token.lemma_)

doc = nlp(‘went gone go goes going’)

for token in doc:

print(token.text, ‘\t’, token.lemma_)

went go gone go go go goes go going go

went go

gone go

go go

goes go

going go

The words went, gone, goes, going all has been converted to its root ‘go’ unlike the porter stemmer which converted the word goes to goe, went to went and gone to gone.

Stop Words:

Stop words are words that are very common in the language and do not contribute anything to the semantics of a sentence.

So, we can remove stop words from the text before an NLP process, as they occur very frequently and provide little value in helping documents differentiate themselves. Their presence also doesn’t have much impact on the sense of a sentence.

Examples of stop words include “a,” “am,” “and,” “the,” “in,” “of,” and more.

We can import the default stop words list in spaCy using the following code.

from spacy.lang.en.stop_words import STOP_WORDS print(len(STOP_WORDS))

from spacy.lang.en.stop_words import STOP_WORDS

print(len(STOP_WORDS))

There are a total of 326 stopwords. Let’s print them.

{‘can’, ‘noone’, ‘anyway’, ‘latterly’, ‘so’, ‘how’, ‘nor’, ‘top’, ‘whereafter’, “‘d”, ‘whence’, ‘moreover’, ‘per’, ‘amongst’, ‘because’, ‘once’, ‘always’, ‘therefore’, ‘further’, ‘beyond’, ‘herself’, ‘may’, ‘twelve’, ‘afterwards’, ‘never’, ‘elsewhere’, ‘while’, ‘still’, ‘otherwise’, ‘into’, ‘himself’, ‘with’, ‘around’, “‘m”, ‘somehow’, ‘everything’, ‘least’, ‘she’, ‘see’, ‘’ve’, ‘my’, ‘namely’, ‘nothing’, ‘‘d’, ‘but’, ‘toward’, ‘back’, ‘n‘t’, ‘eight’, ‘last’, ‘together’, ‘they’, ‘up’, ‘hereby’, “‘re”, “‘ll”, ‘since’, ‘everywhere’, ‘anywhere’, ‘thence’, ‘show’, ‘is’, ‘thereupon’, ‘its’, ‘really’, ‘became’, ‘themselves’, ‘him’, ‘get’, ‘wherever’, ‘take’, ‘nevertheless’, “‘ve”, ‘did’, ‘am’, ‘fifty’, ‘make’, ‘same’, ‘else’, ‘thru’, ‘onto’, ‘nine’, ‘whereupon’, ‘who’, ‘unless’, ‘by’, ‘an’, ‘now’, ‘hence’, ‘many’, ‘before’, ‘say’, ‘cannot’, ‘on’, ‘sometimes’, ‘even’, ‘six’, ‘fifteen’, ‘please’, ‘them’, ‘enough’, ‘it’, ‘formerly’, ‘these’, ‘’d’, ‘above’, ‘his’, ‘whither’, ‘becomes’, ‘more’, ‘hers’, ‘though’, ‘about’, ‘seemed’, ‘first’, ‘meanwhile’, ‘again’, ‘through’, ‘four’, ‘under’, ‘a’, ‘via’, ‘hereafter’, ‘whoever’, ‘why’, ‘wherein’, ‘perhaps’, ‘neither’, ‘n’t’, ‘had’, ‘i’, ‘‘s’, ‘whenever’, ‘quite’, ‘what’, ‘anything’, ‘bottom’, ‘must’, ‘myself’, ‘’s’, ‘beside’, ‘’re’, ‘only’, ‘almost’, ‘whereas’, ‘much’, ‘’ll’, ‘would’, ‘often’, ‘here’, ‘used’, ‘mostly’, ‘own’, ‘someone’, ‘therein’, ‘behind’, “n’t”, ‘at’, ‘was’, ‘seem’, ‘nobody’, ‘somewhere’, ’empty’, ‘go’, ‘alone’, ‘ever’, ‘due’, ‘already’, ‘whom’, ‘former’, ‘down’, ‘some’, ‘ten’, ‘be’, ‘between’, ‘within’, ‘in’, ‘every’, ‘that’, ‘being’, ‘our’, ‘each’, ‘becoming’, ‘might’, ‘seeming’, ‘hereupon’, ‘various’, ‘been’, ‘were’, ‘yourself’, ‘and’, ‘her’, ‘most’, ‘their’, ‘we’, ‘re’, ‘five’, ‘whose’, ‘after’, ‘have’, ‘except’, ‘other’, ‘anyhow’, ‘as’, ‘few’, ‘forty’, ‘latter’, ‘the’, ‘twenty’, ‘of’, ‘whereby’, ‘sixty’, ‘all’, ‘thereafter’, ‘done’, ‘has’, “‘s”, ‘nowhere’, ‘next’, ‘you’, ‘two’, ‘although’, ‘those’, ‘whole’, ‘eleven’, ‘thereby’, ‘without’, ‘’m’, ‘either’, ‘us’, ‘well’, ‘however’, ‘itself’, ‘move’, ‘ca’, ‘no’, ‘towards’, ‘to’, ‘beforehand’, ‘across’, ‘both’, ‘part’, ‘your’, ‘others’, ‘yet’, ‘along’, ‘not’, ‘for’, ‘less’, ‘throughout’, ‘made’, ‘he’, ‘‘ll’, ‘‘m’, ‘any’, ‘herein’, ‘sometime’, ‘‘re’, ‘rather’, ‘then’, ‘third’, ‘besides’, ‘ourselves’, ‘where’, ‘name’, ‘amount’, ‘until’, ‘whatever’, ‘regarding’, ‘very’, ‘or’, ‘anyone’, ‘side’, ‘against’, ‘too’, ‘among’, ‘full’, ‘out’, ‘there’, ‘over’, ‘put’, ‘below’, ‘front’, ‘mine’, ‘do’, ‘are’, ‘something’, ‘yours’, ‘should’, ‘upon’, ‘such’, ‘‘ve’, ‘than’, ‘thus’, ‘ours’, ‘another’, ‘if’, ‘several’, ‘call’, ‘give’, ‘this’, ‘yourselves’, ‘using’, ‘which’, ‘everyone’, ‘serious’, ‘none’, ‘one’, ‘indeed’, ‘from’, ‘will’, ‘during’, ‘just’, ‘keep’, ‘me’, ‘when’, ‘whether’, ‘become’, ‘off’, ‘doing’, ‘seems’, ‘could’, ‘hundred’, ‘three’, ‘also’, ‘does’}

{‘can’, ‘noone’, ‘anyway’, ‘latterly’, ‘so’, ‘how’, ‘nor’, ‘top’, ‘whereafter’, “‘d”, ‘whence’, ‘moreover’, ‘per’, ‘amongst’, ‘because’, ‘once’, ‘always’, ‘therefore’, ‘further’, ‘beyond’, ‘herself’, ‘may’, ‘twelve’, ‘afterwards’, ‘never’, ‘elsewhere’, ‘while’, ‘still’, ‘otherwise’, ‘into’,

‘himself’, ‘with’, ‘around’, “‘m”, ‘somehow’, ‘everything’, ‘least’, ‘she’, ‘see’, ‘’ve’, ‘my’, ‘namely’, ‘nothing’, ‘‘d’, ‘but’, ‘toward’, ‘back’, ‘n‘t’, ‘eight’, ‘last’, ‘together’, ‘they’, ‘up’, ‘hereby’, “‘re”, “‘ll”, ‘since’, ‘everywhere’, ‘anywhere’, ‘thence’, ‘show’, ‘is’, ‘thereupon’,

‘its’, ‘really’, ‘became’, ‘themselves’, ‘him’, ‘get’, ‘wherever’, ‘take’, ‘nevertheless’, “‘ve”, ‘did’, ‘am’, ‘fifty’, ‘make’, ‘same’, ‘else’, ‘thru’, ‘onto’, ‘nine’, ‘whereupon’, ‘who’, ‘unless’, ‘by’, ‘an’, ‘now’, ‘hence’, ‘many’, ‘before’, ‘say’, ‘cannot’, ‘on’, ‘sometimes’, ‘even’, ‘six’,

‘fifteen’, ‘please’, ‘them’, ‘enough’, ‘it’, ‘formerly’, ‘these’, ‘’d’, ‘above’, ‘his’, ‘whither’, ‘becomes’, ‘more’, ‘hers’, ‘though’, ‘about’, ‘seemed’, ‘first’, ‘meanwhile’, ‘again’, ‘through’, ‘four’, ‘under’, ‘a’, ‘via’, ‘hereafter’, ‘whoever’, ‘why’, ‘wherein’, ‘perhaps’, ‘neither’, ‘n’t’, ‘had’,

‘i’, ‘‘s’, ‘whenever’, ‘quite’, ‘what’, ‘anything’, ‘bottom’, ‘must’, ‘myself’, ‘’s’, ‘beside’, ‘’re’, ‘only’, ‘almost’, ‘whereas’, ‘much’, ‘’ll’, ‘would’, ‘often’, ‘here’, ‘used’, ‘mostly’, ‘own’, ‘someone’, ‘therein’, ‘behind’, “n’t”, ‘at’, ‘was’, ‘seem’, ‘nobody’, ‘somewhere’, ’empty’, ‘go’, ‘alone’,

‘ever’, ‘due’, ‘already’, ‘whom’, ‘former’, ‘down’, ‘some’, ‘ten’, ‘be’, ‘between’, ‘within’, ‘in’, ‘every’, ‘that’, ‘being’, ‘our’, ‘each’, ‘becoming’, ‘might’, ‘seeming’, ‘hereupon’, ‘various’, ‘been’, ‘were’, ‘yourself’, ‘and’, ‘her’, ‘most’, ‘their’, ‘we’, ‘re’, ‘five’, ‘whose’, ‘after’, ‘have’, ‘except’,

‘other’, ‘anyhow’, ‘as’, ‘few’, ‘forty’, ‘latter’, ‘the’, ‘twenty’, ‘of’, ‘whereby’, ‘sixty’, ‘all’, ‘thereafter’, ‘done’, ‘has’, “‘s”, ‘nowhere’, ‘next’, ‘you’, ‘two’, ‘although’, ‘those’, ‘whole’, ‘eleven’, ‘thereby’, ‘without’, ‘’m’, ‘either’, ‘us’, ‘well’, ‘however’, ‘itself’, ‘move’, ‘ca’, ‘no’,

‘towards’, ‘to’, ‘beforehand’, ‘across’, ‘both’, ‘part’, ‘your’, ‘others’, ‘yet’, ‘along’, ‘not’, ‘for’, ‘less’, ‘throughout’, ‘made’, ‘he’, ‘‘ll’, ‘‘m’, ‘any’, ‘herein’, ‘sometime’, ‘‘re’, ‘rather’, ‘then’, ‘third’, ‘besides’, ‘ourselves’, ‘where’, ‘name’, ‘amount’, ‘until’, ‘whatever’, ‘regarding’, ‘very’,

‘or’, ‘anyone’, ‘side’, ‘against’, ‘too’, ‘among’, ‘full’, ‘out’, ‘there’, ‘over’, ‘put’, ‘below’, ‘front’, ‘mine’, ‘do’, ‘are’, ‘something’, ‘yours’, ‘should’, ‘upon’, ‘such’, ‘‘ve’, ‘than’, ‘thus’, ‘ours’, ‘another’, ‘if’, ‘several’, ‘call’, ‘give’, ‘this’, ‘yourselves’, ‘using’, ‘which’, ‘everyone’,

‘serious’, ‘none’, ‘one’, ‘indeed’, ‘from’, ‘will’, ‘during’, ‘just’, ‘keep’, ‘me’, ‘when’, ‘whether’, ‘become’, ‘off’, ‘doing’, ‘seems’, ‘could’, ‘hundred’, ‘three’, ‘also’, ‘does’}

Now we’ll see how to remove stop words in a sentence.

nlp = spacy.load(‘en’) doc = nlp(‘I have been to the U.K., U.S.A. and France.’) print([token.text for token in doc if token.text not in STOP_WORDS])

nlp = spacy.load(‘en’)

doc = nlp(‘I have been to the U.K., U.S.A. and France.’)

print([token.text for token in doc if token.text not in STOP_WORDS])

[‘I’, ‘U.K.’, ‘,’, ‘U.S.A.’, ‘France’, ‘.’]

[‘I’, ‘U.K.’, ‘,’, ‘U.S.A.’, ‘France’, ‘.’]

However, the problem with the default stop word list is that it also contains words which might be useful. Take a look at the following example.

doc = nlp(‘I did not like the movie’) print([token.text for token in doc if token.text not in STOP_WORDS])

doc = nlp(‘I did not like the movie’)

print([token.text for token in doc if token.text not in STOP_WORDS])

As you can see from the result since the default stop words list includes the word ‘not’ it has been removed from our sentence.

Because of that, the sentiment of the sentence completely changed from negative to positive.

So, what we can do is we can create another set with words we think are not stopwords and subtract it from the default stopwords to remove them.

not_stopwords = {‘not’} STOP_WORDS = STOP_WORDS – not_stopwords

not_stopwords = {‘not’}

STOP_WORDS = STOP_WORDS – not_stopwords

Now let’s again print the results.

doc = nlp(‘I did not like the movie.’) print([token.text for token in doc if token.text not in STOP_WORDS])

doc = nlp(‘I did not like the movie.’)

print([token.text for token in doc if token.text not in STOP_WORDS])

[‘I’, ‘did’, ‘not’, ‘like’, ‘movie’]

[‘I’, ‘did’, ‘not’, ‘like’, ‘movie’]

You can add as many words as you want to the set not_stopwords.

Part of Speech Tagging:

A POS tag tells us the part-of-speech of a given word. The common categories include nouns, verbs, articles, pronouns, adverbs, and so on.

import spacy nlp = spacy.load(‘en’) doc = nlp(u’I did not like the movie’) print([(token, token.pos_) for token in doc])

import spacy

nlp = spacy.load(‘en’)

doc = nlp(u‘I did not like the movie’)

print([(token, token.pos_) for token in doc])

Assigning POS tags to words is not an easy task as the part of speech for the word is dependent on the context and change as per the sentence.

Let’s see an example with the word ‘play’. Take a look at the following two sentences.

‘I like to play in the park with my friends’ and

‘We’re going to see a play tonight at the theater’.

In the first sentence the word play is a ‘verb’ and in the second sentence the word play is a ‘noun’.

Let’s see how Spacy’s POS tagger performs

doc = nlp(‘I like to play in the park with my friends’) print([(token, token.pos_) for token in doc if token.text == ‘play’])

doc = nlp(‘I like to play in the park with my friends’)

print([(token, token.pos_) for token in doc if token.text == ‘play’])

For this sentence the word ‘play’ has been correctly identified as a verb.

Now let’s check for the second sentence.

doc = nlp(‘We’re going to see a play tonight at the theater’) print([(token, token.pos_) for token in doc if token.text == ‘play’])

doc = nlp(‘We’re going to see a play tonight at the theater’)

print([(token, token.pos_) for token in doc if token.text == ‘play’])

spaCy has identified the POS for the word ‘play’ correctly in both the sentences.

To get the complete list of POS tags in spaCy visit the link https://spacy.io/api/annotation#pos-tagging

Named Entity Recognition:

Named entity recognition identifies different entities in a text sequence, like places, people, locations, etc. The following is the list of built-in entity types in spaCy

PERSON: People, including fictional ones
NORP: Nationalities or religious or political groups
FACILITY: Buildings, airports, highways, bridges, and so on
ORG: Companies, agencies, institutions, and so on
GPE: Countries, cities, and states
LOC: Non GPE locations, mountain ranges, and bodies of water
PRODUCT: Objects, vehicles, foods, and so on (not services)
EVENT: Named hurricanes, battles, wars, sports events, and so on
WORK_OF_ART: Titles of books, songs, and so on
LAW: Named documents made into laws
LANGUAGE: Any named language

One problem with Named Entity Recognition is that they are domain-specific. A NER model developed for one domain may not perform well for other domains.

One such domain is Healthcare. Take a look at the following example.

doc = nlp(‘He had been previously diagnosed with hyperthyroidism’) print([(token, token.ent_type_) for token in doc])

doc = nlp(‘He had been previously diagnosed with hyperthyroidism’)

print([(token, token.ent_type_) for token in doc])

[(He, ”), (had, ”), (been, ”), (previously, ”), (diagnosed, ”), (with, ”), (hyperthyroidism, ”)]

[(He, ”), (had, ”), (been, ”), (previously, ”), (diagnosed, ”), (with, ”), (hyperthyroidism, ”)]

We expect the NER model to detect ‘hyperthyroidism’ as a disease. However, as you can see the Spacy’s NER model failed to identify hyperthyroidism as a disease.

So, in such cases, it is best to train our own NER model.

SUMMARY:

In this article, we dicussed various techniques related to NLP preprocessing. The following is the quick summary of the techniques that we discussed

Tokenization: process of breaking up a text sequence into tokens which can be sentences, words, numbers or punctuation marks.

Lemmatization: process of converting a word to its base form.

Stop word removal: removing words that doesn’t have much impact on the sense of a sentence.

Part of Speech Tagging: identifying the part of speech of a word.

Named Entity Recognition: identifying different entities in a text sequence, like places, people, locations, etc.

Source link