NLP Projects: Next Word Prediction and Question Answering
🕒 2025-04-23 05:40:43.193235Let's make NLP Projects.
What will you learn?
Next word prediction:
If given the three words of a sentence, it should be able to guess the fourth. Long-Short-Term Memory is what I'm employing in this situation.
Note: Recurrent Neural Network (RNN) architecture is a tool used in deep learning, and LSTM falls under this category. Input, output, forget, and cell are the components of a typical LSTM unit.
For more information, click here
1: Import libraries
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
import pickle
import numpy as np
import os
2: Load your file
from google.colab import files
uploaded = files.upload()
3: Open and pre-process the data
file = open("Pride and Prejudice.txt", "r", encoding = "utf8")
# store file in list
lines = []
for i in file:
lines.append(i)
# Convert list to string
data = ""
for i in lines:
data = ' '. join(lines)
#replace unnecessary stuff with space
data = data.replace('\n', '').replace('\r', '').replace('\ufeff', '').replace('“','').replace('”','') #new line, carriage return, unicode character --> replace by space
#remove unnecessary spaces
data = data.split()
data = ' '.join(data)
data[:500]
len(data)
4: Implement tokenization and make additional adjustments
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
# saving the tokenizer for predict function
pickle.dump(tokenizer, open('token.pkl', 'wb'))
sequence_data = tokenizer.texts_to_sequences([data])[0]
sequence_data[:15]
len(sequence_data)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
sequences = []
for i in range(3, len(sequence_data)):
words = sequence_data[i-3:i+1]
sequences.append(words)
print("The Length of sequences are: ", len(sequences))
sequences = np.array(sequences)
sequences[:10]
X = []
y = []
for i in sequences:
X.append(i[0:3])
y.append(i[3])
X = np.array(X)
y = np.array(y)
print("Data: ", X[:10])
print("Response: ", y[:10])
y = to_categorical(y, num_classes=vocab_size)
y[:5]
5: Creating the model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=3))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))
model.summary()
6: Plot the model
from tensorflow import keras
from keras.utils.vis_utils import plot_model
keras.utils.plot_model(model, to_file='plot.png', show_layer_names=True)
7: Train the model
from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint("next_words.h5", monitor='loss', verbose=1, save_best_only=True)
model.compile(loss="categorical_crossentropy", optimizer=Adam(learning_rate=0.001))
model.fit(X, y, epochs=70, batch_size=64, callbacks=[checkpoint])
8: Let's predict
from tensorflow.keras.models import load_model
import numpy as np
import pickle
# Load the model and tokenizer
model = load_model('next_words.h5')
tokenizer = pickle.load(open('token.pkl', 'rb'))
def Predict_Next_Words(model, tokenizer, text):
sequence = tokenizer.texts_to_sequences([text])
sequence = np.array(sequence)
preds = np.argmax(model.predict(sequence))
predicted_word = ""
for key, value in tokenizer.word_index.items():
if value == preds:
predicted_word = key
break
print(predicted_word)
return predicted_word
while(True):
text = input("Enter your line: ")
if text == "0":
print("Execution completed.....")
break
else:
try:
text = text.split(" ")
text = text[-3:]
print(text)
Predict_Next_Words(model, tokenizer, text)
except Exception as e:
print("Error occurred: ",e)
continue
Final output:
Enter your line: The Project Gutenberg (our input)
['The', 'Project', 'Gutenberg'] (takes the last three words of our input)
literary (return a next word)
Enter your line: The Project Gutenberg eBook of
['Gutenberg', 'eBook', 'of']
pride
Enter your line: how can you abuse your own
['abuse', 'your', 'own']
children
Enter your line: He was quite
['He', 'was', 'quite']
young
Enter your line: He could not help seeing that you were about five times as
['five', 'times', 'as']
pretty
Enter your line: and her sister
['and', 'her', 'sister']
scarcely
Enter your line: however, it may all come to
['all', 'come', 'to']
nothing
Enter your line: 0 (to stop execution)
Execution completed.....
Question Answering using Bert | Natural Language Processing
What is Bert?
Now let's move to the Question Answering project using Bert:
1: Install Required Libraries
pip install transformers
pip install torch
2: Importing required packages
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
import torch
import numpy as np
3: Load the pre-trained Bert model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer_for_bert = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
4: Define the function for question-answering
def bert_question_answer(question, passage, max_len=500):
"""
question: What is the name of YouTube Channel
passage: Watch complete playlist of Natural Language Processing. Don't forget to like, share and subscribe my channel IG Tech Team
"""
#Tokenize input question and passage
#Include unique tokens- [CLS] and [SEP]
input_ids = tokenizer_for_bert.encode (question, passage, max_length= max_len, truncation=True)
"""
[101, 2054, 2003, 1996, 2171, 1997, 7858, 3149, 102, 3422, 3143, 2377, 9863, 1997, 3019, 2653, 6364, 1012,
2123, 1005, 1056, 5293, 2000, 2066, 1010, 3745, 1998, 4942, 29234, 2026, 3149, 1045, 2290, 6627, 2136, 102]
"""
#Getting number of tokens in 1st sentence (question) and 2nd sentence (passage that contains answer)
sep_index = input_ids.index(102)
len_question = sep_index + 1
len_passage = len(input_ids)- len_question
"""
8
9
27
"""
#Need to separate question and passage
#Segment ids will be 0 for question and 1 for passage
segment_ids = [0]*len_question + [1]*(len_passage)
"""
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
"""
#Converting token ids to tokens
tokens = tokenizer_for_bert.convert_ids_to_tokens(input_ids)
"""
tokens = ['[CLS]', 'what', 'is', 'the', 'name', 'of', 'youtube', 'channel', '[SEP]', 'watch', 'complete',
'play', '##list', 'of', 'natural', 'language', 'processing', '.', 'don', "'", 't', 'forget', 'to', 'like',
',', 'share', 'and', 'sub', '##scribe', 'my', 'channel', 'i', '##g', 'tech', 'team', '[SEP]']
"""
#Getting start and end scores for answer
#Converting input arrays to torch tensors before passing to the model
start_token_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]) )[0]
end_token_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]) )[1]
"""
tensor([[-5.9787, -3.0541, -7.7166, -5.9291, -6.8790, -7.2380, -1.8289, -8.1006,
-5.9786, -3.9319, -5.6230, -4.1919, -7.2068, -6.7739, -2.3960, -5.9425,
-5.6828, -8.7007, -4.2650, -8.0987, -8.0837, -7.1799, -7.7863, -5.1605,
-8.2832, -5.1088, -8.1051, -5.3985, -6.7129, -1.4109, -3.2241, 1.5863,
-4.9714, -4.1138, -5.9107, -5.9786]], grad_fn=<SqueezeBackward1>)
tensor([[-2.1025, -2.9121, -5.9192, -6.7459, -6.4667, -5.6418, -1.4504, -3.1943,
-2.1024, -5.7470, -6.3381, -5.8520, -3.4871, -6.7667, -5.4711, -3.9885,
-1.2502, -4.0869, -6.4930, -6.3751, -6.1309, -6.9721, -7.5558, -6.4056,
-6.7456, -5.0527, -7.3854, -7.0440, -4.3720, -3.8936, -2.1085, -5.8211,
-2.0906, -2.2184, 1.4268, -2.1026]], grad_fn=<SqueezeBackward1>)
"""
#Converting scores tensors to numpy arrays
start_token_scores = start_token_scores.detach().numpy().flatten()
end_token_scores = end_token_scores.detach().numpy().flatten()
"""
[-5.978666 -3.0541189 -7.7166095 -5.929051 -6.878973 -7.238004
-1.8289301 -8.10058 -5.9786286 -3.9319289 -5.6229596 -4.191908
-7.20684 -6.773916 -2.3959794 -5.942456 -5.6827617 -8.700695
-4.265001 -8.09874 -8.083673 -7.179875 -7.7863474 -5.16046
-8.283156 -5.108819 -8.1051235 -5.3984528 -6.7128663 -1.4108785
-3.2240815 1.5863497 -4.9714 -4.113782 -5.9107194 -5.9786243]
[-2.1025064 -2.912148 -5.9192414 -6.745929 -6.466673 -5.641759
-1.4504088 -3.1943028 -2.1024144 -5.747039 -6.3380575 -5.852047
-3.487066 -6.7667046 -5.471078 -3.9884708 -1.2501552 -4.0868535
-6.4929943 -6.375147 -6.130891 -6.972091 -7.5557766 -6.405638
-6.7455807 -5.0527067 -7.3854156 -7.043977 -4.37199 -3.8935976
-2.1084964 -5.8210607 -2.0906193 -2.2184045 1.4268283 -2.1025767]
"""
#Getting start and end index of answer based on highest scores
answer_start_index = np.argmax(start_token_scores)
answer_end_index = np.argmax(end_token_scores)
"""
31
34
"""
#Getting scores for start and end token of the answer
start_token_score = np.round(start_token_scores[answer_start_index], 2)
end_token_score = np.round(end_token_scores[answer_end_index], 2)
"""
1.59
1.43
"""
#Combining subwords starting with ## and get full words in output.
#It is because tokenizer breaks words which are not in its vocab.
answer = tokens[answer_start_index]
for i in range(answer_start_index + 1, answer_end_index + 1):
if tokens[i][0:2] == '##':
answer += tokens[i][2:]
else:
answer += ' ' + tokens[i]
# If the answer didn't find in the passage
if (start_token_score < 0 ) or ( answer_start_index == 0) or ( answer_end_index < answer_start_index) or (answer == '[SEP]'):
answer = "Sorry!, I was unable to discover an answer in the passage."
return (answer_start_index, answer_end_index, start_token_score, end_token_score, answer)
#Testing function
bert_question_answer("What is the name of YouTube Channel", "Watch complete playlist of Natural Language Processing. Don't forget to like, share and subscribe my channel IG Tech Team ")
# Let me define one passage
passage = """Hello, I am Ishwar. My friend name is Ajay. He is the son of Kristen. I spend most of the time with Ajay.
He always call me by my nick name. Ajay call me programmer. Except Ajay, my other friend call me by my original name.
Bijay is also my friend. """
print (f'Length of the passage: {len(passage.split())} words')
question1 ="What is my name"
print ('\nQuestion 1:\n', question1)
_, _ , _ , _, ans = bert_question_answer( question1, passage)
print('\nAnswer from BERT: ', ans , '\n')
question2 ="Who is the father of Ajay"
print ('\nQuestion 2:\n', question2)
_, _ , _ , _, ans = bert_question_answer( question2, passage)
print('\nAnswer from BERT: ', ans , '\n')
question3 ="With whom does Ishwar spend the majority of his time?"
print ('\nQuestion 3:\n', question3)
_, _ , _ , _, ans = bert_question_answer( question3, passage)
print('\nAnswer from BERT: ', ans , '\n')
Again another,
# Let me define another passage
passage= """"Natural language processing," or NLP, is the study of interactions between computers and humans using natural language. It is employed in the application of machine learning algorithms to text and speech. NLP can be used to develop systems such as speech recognition, document summarization, machine translation, spam detection, named entity recognition, question answering, autocomplete, predictive typing, and so on. Most of us now have cellphones with speech recognition. NLP is used by these devices to understand what is said. In addition, many people use laptops with operating systems that include built-in speech recognition. For creating Python programs that work with human language data, the Natural Language Toolkit (NLTK) is a well-liked framework. The list of text-processing tools also includes programs for categorization, tokenization, stemming, tagging, parsing, and semantic reasoning. The best part is that NLTK is a free, open source, community-driven project. We'll utilize this toolbox to demonstrate the fundamentals of the natural language processing discipline. I'll assume we've loaded the NLTK toolbox for the following examples. We can achieve this by importing nltk.
Sentence tokenization (also known as sentence segmentation) is the challenge of breaking down a string of written language into individual sentences. The concept appears to be extremely easy. The task of breaking a string of written language into its component words is known as word tokenization (also known as word segmentation). Space is a good approximation of a word divider in English and many other languages that use some type of Latin alphabet. However, we may still have issues if we merely divide by space to attain the desired results. Some English compound nouns are written differently, and some contain a space. In most cases, we use a library to obtain the desired effects, so don't sweat the minutiae. Stop words are words that are removed from the text before or after it is processed. When machine learning is applied to text, these words might create a lot of noise. That is why we aim to get rid of these superfluous terms.
Stop words are typically the most common terms in a language, such as "and," "the," and "a," but there is no single universal list of stopwords. The list of stop words may fluctuate depending on your application. The NLTK tool includes a predefined list of stopwords that correspond to the most frequently used words. If you are using it for the first time, you must download the stop words using the following code: nltk.download("stopwords"). Once the download is complete, we can load the stopwords package from nltk.corpus and utilize it to load the stop words."""
print (f'Length of the passage: {len(passage.split())} words')
question ="What is full form of NLTK"
print ('\nQuestion 1:\n', question)
_, _ , _ , _, ans = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans , '\n')
question ="What are stop words "
print ('\nQuestion 2:\n', question)
_, _ , _ , _, ans = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans , '\n')
question ="What is NLP "
print ('\nQuestion 3:\n', question)
_, _ , _ , _, ans = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans , '\n')
question ="How to get NLTK Stop Words"
print ('\nQuestion 4:\n', question)
_, _ , _ , _, ans = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans , '\n')
question ="How do smartphones interpret speech recognition"
print ('\nQuestion 5:\n', question)
_, _ , _ , _, ans = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans , '\n')
question ="What is Computer vision"
print ('\nQuestion 6:\n', question)
_, _ , _ , _, ans = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans , '\n')
question ="What is supervised learning"
print ('\nQuestion 7:\n', question)
_, _ , _ , _, ans = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans , '\n')
Final Question Answering application:
#@title Question-Answering Application { vertical-output: true }
#@markdown ---
question= "what is natural language processing" #@param {type:"string"}
passage = "NLP stands for Natural Language Processing. NLP is a branch of Artificial Intelligence (AI) that studies how machines understand human language. This is the complete playlist of Natural Language Processing. I have made several video related to this like Tokenizer, stop words, sequence to sequence model, Bert (Bi-directional Encoder Representation from Transformers. Bert is bidirectional." #@param {type:"string"}
#@markdown ---
_, _ , _ , _, ans = bert_question_answer( question, passage)
#@markdown Answer:
print(ans)
Conclusion
In this way, you can make a complete NLP Projects with LSTM (Next word prediction) and BERT (Question Answering). If you are looking for a complete course on NLP, you can click here.
I hope from this post, you get an idea of how can you create your own NLP Projects. If you have any questions, drop your comment in the comment section below. I will back to you as soon as possible. Keep learning mate.
Comments
Loading comments...
Leave a Comment