What will you learn [NLP Essentials]?

Embedding Layer
Word2Vec in NLP
Glove (Global Vectors for Word Representation)

Embedding layer

An embedding layer is a word embedding technique used in natural language processing. It is one of the available layers in Keras i.e., Keras provides an Embedding layer that can be used for neural networks on text data. The embedding layer is defined as the first hidden layer of the network.

This layer takes three arguments:

1. input_dim: the size of the vocabulary in text data

2. output_dim: the size of the output or vector space that the words will be embedded.

3. input_length: length of the input sequences

The embedding layer is initialized with random weights and then learns word embeddings of the training dataset. Other Word embedding techniques are Word2Vec and Glove.

import tensorflow as tf
tf.__version__

reviews=['awesome movie',
         'not good',
         'I loved this movie',
         'Good direction',
         'awful',
         'rocks',
         'Never coming back',
         'bad story'
]

### Vocabulary size
voc_size=5000

# One hot representation
from tensorflow.keras.preprocessing.text import one_hot
onehot=[one_hot(words,voc_size) for words in reviews] 
print(onehot)

# Embeding Layer
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

max_length=4
embedded_docs=pad_sequences(onehot,padding='post',maxlen=max_length)
print(embedded_docs)

model=Sequential()
model.add(Embedding(input_dim=voc_size,output_dim=10,input_length = max_length))
model.compile(optimizer='adam', loss='binary_crossentropy')

model.summary()
model.predict(embedded_docs)

print(model.predict(embedded_docs[0]))

Word2Vec in NLP | Natural Language Processing

Word2Vec is a method for converting words to numbers or vector representations. This technique was published in 2013. This algorithm includes skip-gram and CBOW models (Continuous Bag of Words model).

As illustrated in the diagram, CBOW predicts a word by a context, whereas Skip-gram predicts the context by a word.

Now, what's the difference between Glove and Word2Vec? The glove is a pre-trained file that is also used to obtain vector representation for words. Basically, Glove pre-computes a large co-occurrence matrix (word*word) in memory, then factorize it. Whereas, Word2Vec processes each co-occurrence separately. Thus, Glove takes more memory whereas Word2Vec takes more time to train. This is based on your project whether to choose Glove or Word2Vec.

To install gensim:

pip install gensim

import nltk
import re
from gensim.models import Word2Vec
from nltk.corpus import stopwords

# Required package from nltk
nltk.download('stopwords')
nltk.download('punkt')

Paragraph= """Frequently Asked Questions
The Basics
What is Colaboratory?
A product of Google Research is Colaboratory, or "Colab" for short. Colab excels in three areas: machine learning, teaching, and data analysis. Through the browser, anyone may write and run arbitrary Python code. In terms of technology, Colab is a hosted Jupyter notebook service that provides free access to computer resources, including GPUs, and doesn't require any setup to use.

Is it really free to use?
Yes. Colab is free to use.

Seems too good to be true. What are the limitations?
Colab's resources are neither endless or guaranteed, and the consumption caps occasionally change. In order for Colab to offer materials for free, this is required. For more details, see Resource Limits

Colab Pro can be of interest to users who want more dependable access to greater resources.

What is the difference between Jupyter and Colab?
Colab is built on the open-source project Jupyter. You can use and share Jupyter Notebooks with others using Colab without having to download, install, or run anything.

Using Colab
What happened to my notebooks, and can I share them?
Colab notebooks are saved in Google Drive and can also be imported from GitHub. Similar to how Google Docs or Sheets may be shared, Colab notebooks can as well. Simply utilize the Google Drive file sharing instructions or click the Share icon in the upper right corner of any Colab notebook.

What will be shared if I share my notebook?
The complete contents of your notebook (text, code, output, and comments) are shared when you share a notebook. By selecting Edit > Notebook settings > Omit code cell output when saving this notebook, you can prevent saving or sharing code cell output. The virtual machine you're using, as well as any custom files and libraries you've installed, will not be shared. As a result, it's a good idea to include cells that install and load any custom libraries or files required by your notebook.

Is it possible to import an existing Jupyter/IPython notebook into Colab?
Yes. From the File menu, select "Upload notebook".

How can I search Colab notebooks?
You can search Colab notebooks using Google Drive. All notebooks in Drive can be viewed by clicking on the Colab logo at the top left of the notebook view. You may also use File > Open notebook to look for notebooks you've recently opened.

Where is my code executed? When I close the browser window, what happens to my execution state?
The code is run in a virtual computer dedicated to your account. Virtual machines are removed after a period of inactivity, and the Colab service enforces a maximum lifetime.

How can I get my data out?
You can download any Colab notebook you've made from Google Drive using these methods, or from the File menu in Colab. The open source Jupyter notebook format (.ipynb) is used to store all Colab notebooks.

How can I reset the virtual machine(s) on which my code runs, and why is this occasionally unavailable?
Select Runtime > Factory reset runtime to restore the original state of all managed virtual machines allocated to you. This can be useful when a virtual machine has become unhealthy, for example, due to an inadvertent overwrite of system files or the installation of incompatible applications. To avoid excessive resource consumption, Colab limits how frequently this can be done. Please try again later if an attempt fails.

Why does drive.mount() occasionally fail with a "timed out" error, and why do I/O operations in drive.mount()-mounted folders occasionally fail?
When the number of files or subfolders in a folder becomes too enormous, Google Drive activities may time out. If the top-level "My Drive" folder contains thousands of objects, mounting the drive will most certainly fail. Failed attempts cache partial state locally before stalling out, so repeated attempts may finally succeed. If you run into this issue, try transferring files and folders directly from "My Drive" into subfolders. When reading from other directories after a successful drive.mount(), a similar situation can arise. Accessing items in any folder with multiple items can result in issues such as OSError: [Errno 5] Input/output error.  Again, you can resolve this issue by relocating directly enclosed items into subfolders.
It's worth noting that "deleting" files or subfolders by transferring them to the Trash may not be adequate; if that doesn't appear to help, Empty your Trash.

Why do Drive operations occasionally fail owing to quota constraints?
Google Drive imposes several restrictions, such as per-user and per-file operation counts and bandwidth constraints. Exceeding these restrictions will result in the same Input/Output error as described above, as well as a message in the Colab UI. Accessing a popular shared file, or accessing too many separate files too soon, is a common reason. Workarounds include:

To prevent other users from exceeding its limits, copy the file using drive.google.com and don't distribute it around.
Choose to copy data from Drive to the Colab VM in an archive format (such as.zip or.tar.gz files) and unarchive the data locally on the VM rather than in the mounted Drive directory to avoid doing several short I/O reads.
Wait till the quotas reset the following day.
Why do storage quota-related drive activities occasionally fail?
The amount of data that each user can save in Google Drive is restricted. If you need to free up space and Google Drive operations are failing with input/output errors or you receive a message that your storage quota has been exceeded, remove some files using drive.google.com and empty your trash. The reclaimed space might not be accessible in Colab right away.

Visit Google Drive to acquire additional Drive space. Please take note that purchasing more Drive space will not result in more disk space becoming accessible for Colab VMs. Subscribing to Colab Pro will.

Resource Limits
Why aren’t resources guaranteed in Colab?
Colab must continue to be adaptable enough to change usage restrictions and hardware availability as needed in order to provide computing resources without charge. The resources that Colab has available change over time to meet changing demand, as well as to support overall growth and other variables.

The resource restrictions prevent some users from doing everything they wish to do in Colab. Many users have expressed a desire for faster GPUs, longer-lasting notebooks, more RAM, greater usage caps, and less erratic usage patterns. The first step we are taking to assist consumers who want more out of Colab is introducing Colab Pro. Our long-term objective is to maintain a free version of Colab while expanding sustainably to serve our users' requirements. Please sample Colab Pro and let us know what you think if you're interested in doing more with Colab than the resource restrictions of the free edition of Colab permit.

What are the usage limits of Colab?
Colab is able to offer free resources in part because it uses dynamic usage caps that occasionally change and does not guarantee or supply unlimited resources. This implies that overall utilization caps, idle timeout durations, maximum VM lifetimes, available GPU kinds, and other parameters change over time. These restrictions are not made public by Colab in part because they can—and occasionally do—change quickly.

Sometimes, users who use Colab interactively rather than for lengthy computations or users who have recently utilized fewer resources in Colab have priority access to GPUs and TPUs. As a result, users who use Colab for lengthy calculations or users who have recently accessed more resources in Colab are more likely to hit utilization caps and have temporary access restrictions placed on GPU and TPU usage. Colab's user interface may be used in conjunction with a local runtime running on the user's own hardware if they have high computing requirements. Colab Pro may be of interest to users who want larger and more reliable use limitations.

What types of GPUs are available in Colab?
As time goes by, several GPU models become available in Colab. For Colab to be able to offer free access to these resources, this is necessary. Nvidia K80s, T4s, P4s, and P100s are frequently among the GPUs that are offered in Colab. What kind of GPU you can connect to at any given time in Colab cannot be selected. Colab Pro might be of interest to those who want more dependable access to Colab's fastest GPUs.

It should be noted that using Colab to mine cryptocurrencies is completely prohibited and may result in your account being completely blocked from using Colab.

How long can notebooks run in Colab?
Virtual machines that can last up to 12 hours at their maximum lifetime are connected to the notebooks that execute them. When kept idle for too long, notebooks will also disconnect from virtual machines. The behavior of the maximum VM lifetime and the idle timeout may change over time or depending on your usage. In order for Colab to provide free computational resources, this is necessary. Colab Pro might be of interest to users that want more stable idle timeout behaviors over time and longer VM lifetimes.

How much memory is available in Colab?
Colab virtual machines' memory allocation changes with time (albeit it remains constant for the duration of the VM). (Over time, RAM can be adjusted, allowing us to keep Colab free.) When Colab determines that you are going to need more memory, you might occasionally be automatically given a VM with more of it. Users that want Colab to run more reliably and with more memory may be interested in Colab Pro.

How can Colab best serve my needs?
In order to avoid a small number of users monopolizing scarce resources, Colab prioritizes resources for users who have recently consumed fewer resources. Consider closing your Colab tabs when you are finished with your work in order to get the most out of Colab. You should also avoid choosing a GPU if it is not necessary for your job. You will be less likely to encounter use caps in Colab as a result of this. Colab Pro may be of interest to users who want to use more resources than are permitted by the free edition of Colab.

My GPU is not being used, according to a message I noticed. What ought I to do?
Accelerated computing environments like GPU and TPU are optional at Colab. The GPU or TPU is not necessarily being used just because code is being run in a GPU or TPU runtime. If you are not using the GPU, we advise switching to a standard runtime to prevent exceeding your GPU usage restrictions. Select Hardware Accelerator from the Runtime > Change Runtime Type menu and set it to None.

See the Tensorflow With GPU and TPUs In Colab example notebooks for demonstrations of how to use GPU and TPU runtimes in Colab.

Additional Questions
What browsers are supported?
The most recent versions of Chrome, Firefox, and Safari were used for the most extensive testing of Colab's compatibility with most popular browsers.

How is this related to colaboratory.jupyter.org?
We collaborated with the Jupyter development team in 2014 to produce a prototype of the tool. Colab has now developed further under the direction of internal usage.

What about other programming languages?
Python and its ecosystem of complementary tools are the main areas of concentration for Colab. We are aware that customers would like support for more Jupyter kernels, such as R or Scala. Although we have no ETA at this time, we would like to support these.

I found a bug or have a question, who do I contact?
Open any Colab notebook. Then choose "Send feedback..." from the Help menu.

Why prompt to enable third-party cookies?
To display rich outputs securely, Colab leverages HTML iframes and service workers hosted on different origins. To use the service workers within iframes, browsers need to have third-party cookies enabled. If you choose not to enable third-party cookies for all websites, you can allow the hostname googleusercontent.com in your browser's settings.

How do I change the editor font?
The editor in Colab employs a simple monospace font. In the majority of contemporary browsers, you can specify which font family is used for monospace. Here's a few common ones:

To configure the "Monospace" font in Firefox, follow the instructions in the Firefox support manuals.
In Chrome, navigate to "chrome://settings/fonts" and modify the section labeled "Fixed-width font".
Does Colab support Python 2?
According to the Python development team, Python 2 will no longer get security updates or bug fixes after January 1st, 2020. Colab is gradually discontinuing support for Python 2 notebooks and has stopped updating the Python 2 runtimes. We advise switching crucial notebooks to Python 3.

To change your Python 2 notebook's runtime to Python 3, choose Runtime > Change Runtime Type and select Python 3. Changing the runtime from Python 3 to Python 2 is not supported. See Porting Python 2 Code to Python 3 for instructions on making the switch from Python 2 to Python 3.

Where can I learn more about Colab Pro?
There is an FAQ for Colab Pro on the Colab Pro sign-up page.
"""

# Preprocessing the data and preparing the dataset
sentences = nltk.sent_tokenize(Paragraph)
final_text = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [word for word in review if word not in set(stopwords.words('english'))]
    final_text.append(review)
print(final_text)
print(len(final_text))

# Training the Word2Vec model
model = Word2Vec(final_text)
print(model)

# finding vocabulary
vocab = model.wv.vocab
print(vocab.keys())

# vocab = model.wv.key_to_index
# print(vocab)

# Finding Word Vectors
vector = model.wv['colab']
print(vector)

# Most similar words
similar = model.wv.most_similar('notebooks')
similar

# finding similarity score between two words in vocab
model.wv.similarity(w1="jupyter",w2="notebooks")

model.wv.similarity(w1="notebooks",w2="notebooks")

print(len(vocab))

Conclusion: We have less vocabulary right now but for large documents, we will have more vocabulary and we can find the most similar words using the word2vec model.

GLOVE (Global Vectors for Word Representation) | Natural Language Processing

Glove (Global Vectors for Word Representation) is an unsupervised learning technique used to obtain vector representations for words. This is a pre-trained file. You can use this file to convert each word into vectors. You can download this file from here. Each word is mapped into its corresponding vectors based on the similarity between each word. The place where it is mapped is called embedding space.

Now, what's the difference between Glove and Word2Vec? Basically, Glove pre-computes a large co-occurrence matrix (word*word) in memory, then factorize it. Whereas, Word2Vec processes each co-occurrence separately. Thus, Glove takes more memory whereas Word2Vec takes more time to train. This is based on your project whether to choose Glove or Word2Vec.

import numpy as np 
import pandas as pd 
import os

os.getcwd()
os.chdir('Desktop\Code\VS code')
os.getcwd()
os.listdir('glove.6B')

def read_data(file_name):
    with open(file_name,'r', encoding='utf-8') as f:
        word_vocab = set() 
        word2vector = dict()
        for line in f:
            line = line.strip() #Remove white space
            words_Vec = line.split()
            word_vocab.add(words_Vec[0])
            word2vector[words_Vec[0]] = np.array(words_Vec[1:],dtype=float)
    print("Total Words in DataSet:",len(word_vocab))
    return word_vocab,word2vector

vocab, w2v = read_data("glove.6B/glove.6B.50d.txt")

# Cosine-similarity
def cos_sim(u,v):
    """
    u: vector of 1st word
    v: vector of 2nd Word
    """
    numerator = np.dot(u,v)
    denominator= np.sqrt(np.sum(np.square(u))) * np.sqrt(np.sum(np.square(v)))
    return numerator/denominator

print("Similarity Score of King and Queen",cos_sim(w2v['king'],w2v['queen']))
print("Similarity Score of Father and Apple",cos_sim(w2v['father'],w2v['apple']))
print("Similarity Score of Man and Woman",cos_sim(w2v['man'],w2v['woman']))
print("Similarity Score of Clothes and Shoes",cos_sim(w2v['clothes'],w2v['shoes']))
print("Similarity Score of Water and USA",cos_sim(w2v['water'],w2v['usa']))

Conclusion

This post is all about the NLP Essentials: Embedding layer, word2vec, and glove technique. These all are very important concepts in Natural Language Processing. If you are looking for a complete course on NLP, click here.

I hope this post is very helpful to You. Visit my Channel IG Tech Team and get more of your desired content including other NLP Essentials. If you have any questions, ask me in the comment section. I will reply as soon as possible. Keep learning.

NLP Essentials: Embedding Layer, Word2Vec, and GloVe