In this article, we’ll discuss some of the NLP preprocessing techniques while handling the text data.

As you have guessed from the title we’ll use spaCy for most of our tasks in this article. So, if you don’t have it installed see the spaCy installation instructions to get spaCy on your computer.

Tokenization:

When we are working with text data, the first thing we have to do is to divide the text into list of tokens. This is called tokenization. Tokenization is the process of breaking up a text sequence into tokens which can be sentences, words, numbers or punctuation marks.

Performing tokenization using spacy is very straightforward.


You might think that tokenizing is just splitting the word by white spaces or by punctuations like ‘.’ or ‘,’. However it is more complex than that.

Let’s try to tokenize the sentence ‘I have been to the U.K., U.S.A. and France.’


As you can see from the result, the tokenizer identifies the word the U.K and U.S.A as a single entity instead of ‘U’, ‘.’ and ‘K’.

It also identifies the period which followed France denotes the end of a sentence and should be treated as a separate token.

Stemming and Lemmatization:

Stemming and Lemmatisation are two different but very similar methods used to convert a word to its root or base form.

However, the difference between stemming and lemmatization is that stemming is rule-based where we’ll trim or append modifiers that indicate its root word while lemmatization is the process of reducing a word to its canonical form called a lemma.

Typically lemmatization produces a meaningful base form compared to stemming.

Since Spacy doesn’t have stemming we’ll use NLTK to perform stemming.

NLTK provides several famous stemmers like Lancaster, porter, and snowball. The Porter stemmer works very well in many cases so we’ll use it to extract stems from the sentence.

To learn more about the rules of Porter Stemming visit this link.


Since stemming typically removes the last few characters, it might, at times produce results that have no meaning to us as humans. Take a look at the following example.


Now, let’s see how to perform lemmatization using spaCy. We can get the lemma for each token by using the lemma_ attr.


The  words went, gone, goes, going all has been converted to its root ‘go’ unlike the porter stemmer which converted the word goes to goe, went to went and gone to gone.

Stop Words:

Stop words are words that are very common in the language and do not contribute anything to the semantics of a sentence.

So, we can remove stop words from the text before an NLP process, as they occur very frequently and provide little value in helping documents differentiate themselves. Their presence also doesn’t have much impact on the sense of a sentence. 

Examples of stop words include “a,” “am,” “and,” “the,” “in,” “of,” and more.

We can import the default stop words list in spaCy using the following code.


There are a total of 326 stopwords. Let’s print them.

Now we’ll see how to remove stop words in a sentence.


However, the problem with the default stop word list is that it also contains words which might be useful. Take a look at the following example.


As you can see from the result since the default stop words list includes the word ‘not’ it has been removed from our sentence.

Because of that, the sentiment of the sentence completely changed from negative to positive.

So, what we can do is we can create another set with words we think are not stopwords and subtract it from the default stopwords to remove them.

Now let’s again print the results.


You can add as many words as you want to the set not_stopwords.

Part of Speech Tagging:

A POS tag tells us the part-of-speech of a given word. The common categories include nouns, verbs, articles, pronouns, adverbs, and so on.

Assigning POS tags to words is not an easy task as the part of speech for the word is dependent on the context and change as per the sentence.

Let’s see an example with the word ‘play’. Take a look at the following two sentences.

‘I like to play in the park with my friends’ and

We’re going to see a play tonight at the theater’.

In the first sentence the word play is a ‘verb’ and in the second sentence the word play is a ‘noun’.

Let’s see how Spacy’s POS tagger performs


For this sentence the word ‘play’ has been correctly identified as a verb.

Now let’s check for the second sentence.


spaCy has identified the POS for the word ‘play’ correctly in both the sentences.

To get the complete list of POS tags in spaCy visit the link https://spacy.io/api/annotation#pos-tagging

Named Entity Recognition:

Named entity recognition identifies different entities in a text sequence, like places, people, locations, etc. The following is the list of built-in entity types in spaCy

  • PERSON: People, including fictional ones
  • NORP: Nationalities or religious or political groups
  • FACILITY: Buildings, airports, highways, bridges, and so on
  • ORG: Companies, agencies, institutions, and so on
  • GPE: Countries, cities, and states
  • LOC: Non GPE locations, mountain ranges, and bodies of water
  • PRODUCT: Objects, vehicles, foods, and so on (not services)
  • EVENT: Named hurricanes, battles, wars, sports events, and so on
  • WORK_OF_ART: Titles of books, songs, and so on
  • LAW: Named documents made into laws
  • LANGUAGE: Any named language

One problem with Named Entity Recognition is that they are domain-specific. A NER model developed for one domain may not perform well for other domains. 

One such domain is Healthcare. Take a look at the following example.


We expect the NER model to detect ‘hyperthyroidism’ as a disease. However, as you can see the Spacy’s NER model failed to identify hyperthyroidism as a disease. 

So, in such cases, it is best to train our own NER model.

SUMMARY:

In this article, we dicussed various techniques related to NLP preprocessing. The following is the quick summary of the techniques that we discussed

Tokenization: process of breaking up a text sequence into tokens which can be sentences, words, numbers or punctuation marks.

Lemmatization: process of converting a word to its base form.

Stop word removal: removing words that doesn’t have much impact on the sense of a sentence.

Part of Speech Tagging: identifying the part of speech of a word.

Named Entity Recognition: identifying different entities in a text sequence, like places, people, locations, etc.



Source link

author-sign