IGTechTeam

Let's make NLP Projects.

What will you learn?

Next word prediction using LSTM
What is Bert?
Question Answering application using BERT

Next word prediction:

If given the three words of a sentence, it should be able to guess the fourth. Long-Short-Term Memory is what I'm employing in this situation.
Note: Recurrent Neural Network (RNN) architecture is a tool used in deep learning, and LSTM falls under this category. Input, output, forget, and cell are the components of a typical LSTM unit.

For more information, click here

1: Import libraries

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
import pickle
import numpy as np
import os

2: Load your file

from google.colab import files
uploaded = files.upload()

3: Open and pre-process the data

file = open("Pride and Prejudice.txt", "r", encoding = "utf8")

# store file in list
lines = []
for i in file:
    lines.append(i)

# Convert list to string
data = ""
for i in lines:
  data = ' '. join(lines) 

#replace unnecessary stuff with space
data = data.replace('\n', '').replace('\r', '').replace('\ufeff', '').replace('“','').replace('”','')  #new line, carriage return, unicode character --> replace by space

#remove unnecessary spaces 
data = data.split()
data = ' '.join(data)
data[:500]
len(data)

4: Implement tokenization and make additional adjustments

tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

# saving the tokenizer for predict function
pickle.dump(tokenizer, open('token.pkl', 'wb'))

sequence_data = tokenizer.texts_to_sequences([data])[0]
sequence_data[:15]
len(sequence_data)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
sequences = []

for i in range(3, len(sequence_data)):
    words = sequence_data[i-3:i+1]
    sequences.append(words)
    
print("The Length of sequences are: ", len(sequences))
sequences = np.array(sequences)
sequences[:10]
X = []
y = []

for i in sequences:
    X.append(i[0:3])
    y.append(i[3])
    
X = np.array(X)
y = np.array(y)
print("Data: ", X[:10])
print("Response: ", y[:10])
y = to_categorical(y, num_classes=vocab_size)
y[:5]

5: Creating the model

model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=3))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))
model.summary()

6: Plot the model

from tensorflow import keras
from keras.utils.vis_utils import plot_model

keras.utils.plot_model(model, to_file='plot.png', show_layer_names=True)

7: Train the model

from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint("next_words.h5", monitor='loss', verbose=1, save_best_only=True)
model.compile(loss="categorical_crossentropy", optimizer=Adam(learning_rate=0.001))
model.fit(X, y, epochs=70, batch_size=64, callbacks=[checkpoint])

8: Let's predict

from tensorflow.keras.models import load_model
import numpy as np
import pickle

# Load the model and tokenizer
model = load_model('next_words.h5')
tokenizer = pickle.load(open('token.pkl', 'rb'))

def Predict_Next_Words(model, tokenizer, text):

  sequence = tokenizer.texts_to_sequences([text])
  sequence = np.array(sequence)
  preds = np.argmax(model.predict(sequence))
  predicted_word = ""
  
  for key, value in tokenizer.word_index.items():
      if value == preds:
          predicted_word = key
          break
  
  print(predicted_word)
  return predicted_word

while(True):
  text = input("Enter your line: ")
  
  if text == "0":
      print("Execution completed.....")
      break
  
  else:
      try:
          text = text.split(" ")
          text = text[-3:]
          print(text)
        
          Predict_Next_Words(model, tokenizer, text)
          
      except Exception as e:
        print("Error occurred: ",e)
        continue

Final output:

Enter your line: The Project Gutenberg (our input)

['The', 'Project', 'Gutenberg'] (takes the last three words of our input)

literary (return a next word)

Enter your line: The Project Gutenberg eBook of

['Gutenberg', 'eBook', 'of']

pride

Enter your line: how can you abuse your own

['abuse', 'your', 'own']

children

Enter your line: He was quite

['He', 'was', 'quite']

young

Enter your line: He could not help seeing that you were about five times as

['five', 'times', 'as']

pretty

Enter your line: and her sister

['and', 'her', 'sister']

scarcely

Enter your line: however, it may all come to

['all', 'come', 'to']

nothing

Enter your line: 0 (to stop execution)

Execution completed.....

Question Answering using Bert | Natural Language Processing

Question Answering project is very easy to build with BERT because we can use pre-defined files and make things easy. When we ask some questions, it should return the correct answer.

What is Bert?

Bert stands for Bi-directional Encoder Representations from Transformers.
It is a freely available framework for natural language processing (NLP).
Bert was pre-trained using an unlabeled plain text corpus (Entire English Wikipedia, and the Brown Corpus).
If we stack the Transformer encoder, we get Bert.

Now let's move to the Question Answering project using Bert:

1: Install Required Libraries

pip install transformers
pip install torch

2: Importing required packages

from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
import torch
import numpy as np

3: Load the pre-trained Bert model

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

tokenizer_for_bert = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

4: Define the function for question-answering

def bert_question_answer(question, passage, max_len=500):
    
    """
    question: What is the name of YouTube Channel
    passage: Watch complete playlist of Natural Language Processing. Don't forget to like, share and subscribe my channel IG Tech Team
    """

    #Tokenize input question and passage 
    #Include unique tokens- [CLS] and [SEP]
    input_ids = tokenizer_for_bert.encode (question, passage,  max_length= max_len, truncation=True)  
    """
    [101, 2054, 2003, 1996, 2171, 1997, 7858, 3149, 102, 3422, 3143, 2377, 9863, 1997, 3019, 2653, 6364, 1012, 
    2123, 1005, 1056, 5293, 2000, 2066, 1010, 3745, 1998, 4942, 29234, 2026, 3149, 1045, 2290, 6627, 2136, 102]
    """

    #Getting number of tokens in 1st sentence (question) and 2nd sentence (passage that contains answer)
    sep_index = input_ids.index(102) 
    len_question = sep_index + 1   
    len_passage = len(input_ids)- len_question  
    """
    8
    9
    27
    """
    
    #Need to separate question and passage
    #Segment ids will be 0 for question and 1 for passage
    segment_ids =  [0]*len_question + [1]*(len_passage)  
    """
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    """

    #Converting token ids to tokens
    tokens = tokenizer_for_bert.convert_ids_to_tokens(input_ids) 
    """
    tokens = ['[CLS]', 'what', 'is', 'the', 'name', 'of', 'youtube', 'channel', '[SEP]', 'watch', 'complete', 
    'play', '##list', 'of', 'natural', 'language', 'processing', '.', 'don', "'", 't', 'forget', 'to', 'like', 
    ',', 'share', 'and', 'sub', '##scribe', 'my', 'channel', 'i', '##g', 'tech', 'team', '[SEP]']
    """

    #Getting start and end scores for answer
    #Converting input arrays to torch tensors before passing to the model
    start_token_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]) )[0]
    end_token_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]) )[1]
    """
    tensor([[-5.9787, -3.0541, -7.7166, -5.9291, -6.8790, -7.2380, -1.8289, -8.1006,
         -5.9786, -3.9319, -5.6230, -4.1919, -7.2068, -6.7739, -2.3960, -5.9425,
         -5.6828, -8.7007, -4.2650, -8.0987, -8.0837, -7.1799, -7.7863, -5.1605,
         -8.2832, -5.1088, -8.1051, -5.3985, -6.7129, -1.4109, -3.2241,  1.5863,
         -4.9714, -4.1138, -5.9107, -5.9786]], grad_fn=<SqueezeBackward1>)
    tensor([[-2.1025, -2.9121, -5.9192, -6.7459, -6.4667, -5.6418, -1.4504, -3.1943,
         -2.1024, -5.7470, -6.3381, -5.8520, -3.4871, -6.7667, -5.4711, -3.9885,
         -1.2502, -4.0869, -6.4930, -6.3751, -6.1309, -6.9721, -7.5558, -6.4056,
         -6.7456, -5.0527, -7.3854, -7.0440, -4.3720, -3.8936, -2.1085, -5.8211,
         -2.0906, -2.2184,  1.4268, -2.1026]], grad_fn=<SqueezeBackward1>)
    """

    #Converting scores tensors to numpy arrays
    start_token_scores = start_token_scores.detach().numpy().flatten()
    end_token_scores = end_token_scores.detach().numpy().flatten()
    """
    [-5.978666  -3.0541189 -7.7166095 -5.929051  -6.878973  -7.238004
    -1.8289301 -8.10058   -5.9786286 -3.9319289 -5.6229596 -4.191908
    -7.20684   -6.773916  -2.3959794 -5.942456  -5.6827617 -8.700695
    -4.265001  -8.09874   -8.083673  -7.179875  -7.7863474 -5.16046
    -8.283156  -5.108819  -8.1051235 -5.3984528 -6.7128663 -1.4108785
    -3.2240815  1.5863497 -4.9714    -4.113782  -5.9107194 -5.9786243]

    [-2.1025064 -2.912148  -5.9192414 -6.745929  -6.466673  -5.641759
    -1.4504088 -3.1943028 -2.1024144 -5.747039  -6.3380575 -5.852047
    -3.487066  -6.7667046 -5.471078  -3.9884708 -1.2501552 -4.0868535
    -6.4929943 -6.375147  -6.130891  -6.972091  -7.5557766 -6.405638
    -6.7455807 -5.0527067 -7.3854156 -7.043977  -4.37199   -3.8935976
    -2.1084964 -5.8210607 -2.0906193 -2.2184045  1.4268283 -2.1025767]
    """
    #Getting start and end index of answer based on highest scores
    answer_start_index = np.argmax(start_token_scores)
    answer_end_index = np.argmax(end_token_scores)
    """
    31
    34
    """

    #Getting scores for start and end token of the answer
    start_token_score = np.round(start_token_scores[answer_start_index], 2)
    end_token_score = np.round(end_token_scores[answer_end_index], 2)
    """
    1.59
    1.43
    """

    #Combining subwords starting with ## and get full words in output. 
    #It is because tokenizer breaks words which are not in its vocab.
    answer = tokens[answer_start_index] 
    for i in range(answer_start_index + 1, answer_end_index + 1):
        if tokens[i][0:2] == '##':  
            answer += tokens[i][2:] 
        else:
            answer += ' ' + tokens[i]  

    # If the answer didn't find in the passage
    if (start_token_score < 0 ) or ( answer_start_index == 0) or ( answer_end_index <  answer_start_index) or (answer == '[SEP]'):
        answer = "Sorry!, I was unable to discover an answer in the passage."
    
    return (answer_start_index, answer_end_index, start_token_score, end_token_score,  answer)

#Testing function
bert_question_answer("What is the name of YouTube Channel", "Watch complete playlist of Natural Language Processing. Don't forget to like, share and subscribe my channel IG Tech Team ")

Let's use another passage and check the output

# Let me define one passage
passage = """Hello, I am Ishwar. My friend name is Ajay. He is the son of Kristen. I spend most of the time with Ajay. 
He always call me by my nick name. Ajay call me programmer. Except Ajay, my other friend call me by my original name. 
Bijay is also my friend. """

print (f'Length of the passage: {len(passage.split())} words')

question1 ="What is my name" 
print ('\nQuestion 1:\n', question1)
_, _ , _ , _, ans  = bert_question_answer( question1, passage)
print('\nAnswer from BERT: ', ans ,  '\n')


question2 ="Who is the father of Ajay"
print ('\nQuestion 2:\n', question2)
_, _ , _ , _, ans  = bert_question_answer( question2, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question3 ="With whom does Ishwar spend the majority of his time?" 
print ('\nQuestion 3:\n', question3)
_, _ , _ , _, ans  = bert_question_answer( question3, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

Again another,
# Let me define another passage
passage= """"Natural language processing," or NLP, is the study of interactions between computers and humans using natural language. It is employed in the application of machine learning algorithms to text and speech. NLP can be used to develop systems such as speech recognition, document summarization, machine translation, spam detection, named entity recognition, question answering, autocomplete, predictive typing, and so on. Most of us now have cellphones with speech recognition. NLP is used by these devices to understand what is said. In addition, many people use laptops with operating systems that include built-in speech recognition. For creating Python programs that work with human language data, the Natural Language Toolkit (NLTK) is a well-liked framework. The list of text-processing tools also includes programs for categorization, tokenization, stemming, tagging, parsing, and semantic reasoning. The best part is that NLTK is a free, open source, community-driven project. We'll utilize this toolbox to demonstrate the fundamentals of the natural language processing discipline. I'll assume we've loaded the NLTK toolbox for the following examples. We can achieve this by importing nltk. 
Sentence tokenization (also known as sentence segmentation) is the challenge of breaking down a string of written language into individual sentences. The concept appears to be extremely easy. The task of breaking a string of written language into its component words is known as word tokenization (also known as word segmentation). Space is a good approximation of a word divider in English and many other languages that use some type of Latin alphabet. However, we may still have issues if we merely divide by space to attain the desired results. Some English compound nouns are written differently, and some contain a space. In most cases, we use a library to obtain the desired effects, so don't sweat the minutiae. Stop words are words that are removed from the text before or after it is processed. When machine learning is applied to text, these words might create a lot of noise. That is why we aim to get rid of these superfluous terms.
Stop words are typically the most common terms in a language, such as "and," "the," and "a," but there is no single universal list of stopwords. The list of stop words may fluctuate depending on your application. The NLTK tool includes a predefined list of stopwords that correspond to the most frequently used words. If you are using it for the first time, you must download the stop words using the following code: nltk.download("stopwords"). Once the download is complete, we can load the stopwords package from nltk.corpus and utilize it to load the stop words."""

print (f'Length of the passage: {len(passage.split())} words')


question ="What is full form of NLTK"
print ('\nQuestion 1:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="What are stop words "
print ('\nQuestion 2:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="What is NLP "
print ('\nQuestion 3:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="How to get NLTK Stop Words"
print ('\nQuestion 4:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="How do smartphones interpret speech recognition"
print ('\nQuestion 5:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="What is Computer vision"
print ('\nQuestion 6:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="What is supervised learning"
print ('\nQuestion 7:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

Output from this cell will be:

Length of the passage: 433 words
Question 1: What is full form of NLTK
Answer from BERT: natural language toolkit

Question 2: What are stop words
Answer from BERT: words which are filtered out before or after processing of text

Question 3: What is NLP
Answer from BERT: a subfield of computer science and artificial intelligence concerned with interactions between computers and human ( natural ) languages

Question 4: How to download stop words from nltk
Answer from BERT: import nltk

Question 5: What do smartphones use to understand speech recognition
Answer from BERT: nlp

Question 6: What is Computer vision
Answer from BERT: artificial intelligence

Question 7: What is supervised learning
Answer from BERT: Sorry!, I was unable to discover an answer in the passage.

Final Question Answering application:

#@title Question-Answering Application { vertical-output: true }
#@markdown ---
question= "what is natural language processing" #@param {type:"string"}
passage = "NLP stands for Natural Language Processing. NLP is a branch of Artificial Intelligence (AI) that studies how machines understand human language.  This is the complete playlist of Natural Language Processing. I have made several video related to this like Tokenizer, stop words, sequence to sequence model, Bert (Bi-directional Encoder Representation from Transformers. Bert is bidirectional." #@param {type:"string"}
#@markdown ---

_, _ , _ , _, ans  = bert_question_answer( question, passage)

#@markdown Answer:
print(ans)

Final output:

Conclusion

In this way, you can make a complete NLP Projects with LSTM (Next word prediction) and BERT (Question Answering). If you are looking for a complete course on NLP, you can click here.

I hope from this post, you get an idea of how can you create your own NLP Projects. If you have any questions, drop your comment in the comment section below. I will back to you as soon as possible. Keep learning mate.

NLP Projects: Next Word Prediction and Question Answering