IGTechTeam

What will you learn?

Get all the NLP courses with video tutorials (start to end with projects)
Tokenization in Natural Language Processing
Stop Words in NLP
Stemming and Lemmatization

What is Natural Language Processing??

Why Natural Language Processing??

Get all the NLP courses with video tutorials (start to end with projects)

I will be discussing several topics related to Natural Language Processing:

These are the topics on which I am concentrating.

I created an entire playlist for this course.

You can view everything. It will be a lot of fun to study all of these concepts through practical examples and demonstrations.

Tokenization in Natural Language Processing

Tokenization is a process of breaking down a piece of text (word, sentence, paragraph) into smaller units.

For instance, I am studying Natural Language Processing.

After tokenization, the output will be:'I', 'am', 'studying', 'Natural', 'Language', 'Processing'
This is the initial step of text pre-processing, followed by stemming and lemmatization, and so on:

First of all, we have to download nltk (Natural Language toolkit). Then using this, we can download several modules of Natural Language Processing using commands: nltk.download(). Then the rest is just to pass our text of words into this tokenizer.

Tokenization: Split the words into smaller term
Stemming and Lemmatization: Reduce the word to its base form
Feature Extraction and so on.

import nltk
nltk.download()

paragraph="""Natural language processing (NLP) is an artificial intelligence (AI) technique that lets users communicate with intelligent computers using a natural language, like English. Natural language processing is essential when we want an intelligent system, such as a robot, to follow our instructions, when we want to hear a conclusion from a dialogue-based clinical expert system, and so on. The field of NLP is concerned with teaching computers to execute meaningful tasks using the natural languages that humans use. An NLP system's input and output can be speech and written text."""

# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)
print(sentences)
print(len(sentences))

# Tokenizing words
words = nltk.word_tokenize(paragraph)
print(words)
print(len(words))

Stop words in Natural Language Processing

Stop words are the words that can be ignored so as to save space and also to reduce complexity as they don't give too much importance in any text or documents (e.g. to, I, he, his, from, up, down, what, which, a, an, the, etc.)
For example, He is an honest boy.

Here, the stop words are He, is, and an

After removing stop words, our sentence will be: honest boy

How can we know which are stop words?

Stop words are already defined in nltk (Natural Language Toolkit). We can see all the stop words by printing this statement:

stopwords.fileids() We may have avery large file. So, removing these stop words reduces the size of the file and also may not lose our important data.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords.words('english')

# Removing stopwords from given tokens
lst = [word for word in ("Welcome","to","my","channel","IG","Tech","Team") if word not in set(stopwords.words('english'))]
print(lst)

paragraph="""Natural language processing (NLP) is an artificial intelligence (AI) technology that uses a natural language, for instance English, to interact with intelligent systems. Natural language processing is essential when we want an intelligent system, such as a robot, to follow our instructions, when we want to hear a conclusion from a dialogue-based clinical expert system, and so on. The field of NLP is concerned with teaching computers to execute meaningful tasks using the natural languages that humans use. An NLP system's input and output can be speech and written text."""

# Tokenizing words
words = nltk.word_tokenize(paragraph)

# Removing stopwords from above paragraph
lst = [word for word in words if word not in set(stopwords.words('english'))]
print(lst)

# Assignment
#--> Beside english language, what other language can be passed in stopwords.words() as parameter in order to find stopwords?



# Solution:
print(stopwords.fileids())

Stemming and Lemmatization in Natural Language Processing

Stemming and Lemmatization is a very important concept in Natural Language Processing. They transform a specific word into a stem or lemma (i.e., into their base word), such that multiple forms of the same word are treated as the same term.

For example: if there are two words 'girl' and 'girls' in any text document, these words represent the same meaning but they have different verb forms. So, these words may be treated as having different meanings when passing to any model like Bag of Words and TF-IDF. So, what we have to do is, we can apply stemming and lemmatization to convert these words into their root or base form.

Girl --> Girl

Girls --> Girl

But what exactly is the distinction between stemming and lemmatization?

Stemming converts a word into its base form without considering any meaning of that base word.

For example, gone --> gon

I don't think 'gon' has any meaning.

Another, lemmatization converts a word into its base form that has appropriate meaning.

For example, gone --> go/gone i.e. it may change into its base form or leave it as it is.

import nltk
from nltk.corpus import stopwords

# Applying stemming in given words
from nltk.stem import PorterStemmer 
stemmer = PorterStemmer()
words=["go","goes","google","googling","goal","goals"]
words = [stemmer.stem(word) for word in words]
print(words)

# Applying lemmatization in given words
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words=["go","goes","google","googling","goal","goals"]
words = [lemmatizer.lemmatize(word) for word in words]
print(words)


#paragraph
paragraph="""Natural language processing (NLP) is an artificial intelligence (AI) technique for collaborating with intelligent systems that uses a natural language, for instance, English. Natural language processing is essential when we want an intelligent system, such as a robot, to follow our instructions, when we want to hear a conclusion from a dialogue-based clinical expert system, and so on. The field of NLP is concerned with teaching computers to execute meaningful tasks using the natural languages that humans use. An NLP system's input and output can be speech and written text."""

# Apply tokenization and then stemming
sentences = nltk.sent_tokenize(paragraph)
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)
print(sentences)

# Apply tokenization and then Lemmatization
sentences = nltk.sent_tokenize(paragraph)
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)  
print(sentences)

Conclusion:

This is the post on NLP deep dive. In this post, we have learned about Tokenization, stop-words, stemming, and lemmatization. Make sure you have checked the complete playlist. I have made three projects: Movie Recommendation System, Next word prediction, and Question Answering.

I hope this post is very helpful to you. Don't hesitate to ask me in the comment section if you have any questions. I will back to you as soon as possible. Thanks.

NLP Deep Dive: Complete Course on Tokenization, Stop Words, Stemming and Lemmatization