An Introduction to Recent Advances in the Field of NLP

What is the best a machine
can do with text?
Introduction to recent advances in the filed of NLP
Rrubaa Panchendrarajan
Ph.D. Student
National University of Singapore

Natural Language Processing (NLP)
• A sub-filed of Artificial Intelligence (AI)
• Aim : To build intelligent computers that can interact with human
being like a human being
• Interactions are either as writing or speaking (text/audio)

Dealing with text
Preprocessing Learning Application

Why does Preprocessing play a major role?
• Machines can understand only the numbers
• Text is unstructured
• Natural language is highly ambiguous

Ambiguity at word level
A world record
A record of the conversation
Record it

Ambiguity at sentence level
“I saw the man on the hill with a telescope”
1. I saw the man. The man was on the hill. I was using a telescope.
2. I saw the man. I was on the hill. I was using a telescope.
3. I saw the man. The man was on the hill. The hill had a telescope.
4. I saw the man. I was on the hill. The hill had a telescope.
5. I saw the man. The man was on the hill. I saw him using a telescope.

Why does Preprocessing play a major role?
• Machines can understand only the numbers
• Text is unstructured
• Natural language is highly ambiguous
• Language evolves with time

Core research areas
1. Lemmatization
2. Stemming
3. Sentence breaking
4. Morphological Analysis
5. Part-of-speech Tagging
6. Named-entity recognition
7. Word sense disambiguation
8. Lot more….

Named-entity recognition (NER)
• Task of identifying proper names in text and classifying into set of
predefined categories of interest
Lady Gaga is playing a concert for the Bushes in Texas next September
Person Person Location Time
• Applications
1. Question Answering (When is Lady Gaga playing … ? Obviously a time)
2. Machine Translation (Do not need to translate named entities)
3. etc.

Libraries for preprocessing
• NLTK, Genism & Spacy for Python
• Apache OpenNLP & Standford CoreNLP for Java

How to convert words to numbers?
• Straightforward option : One-hot vector representation
“the cat sat on the mat”
Vocabulary = {the, cat, sat, on, mat}
the = [1,0,0,0,0]
cat = [0,1,0,0,0]
sat = [0,0,1,0,0]

Issue with One-Hot vector representation
• Curse of dimensionality
Problem arises with the increase in dimension (vocabulary size)
e.g. memory, performance, processing time
• Not meaningful
Each word is represented arbitrarily & independently (Similarity between any two
vector is 0)
e.g. happy = [1,0,0], joy = [0,1,0]
Cosine similarity = 1*0 + 0*1 + 0*0 = 0

Better solution
• Learn a matrix WV*N , V : vocabulary size, N: fixed & small e.g. 100
• ith row in W indicates the vector (array) representation of ith word
• Train a model to learn W
• W is referred to as “Word Embedding”

Neural Networks came into play
• Organized as layers
• Each layer contains set of neurons
• Job of a single neuron is to process all the inputs to it and pass it to all
the neurons in the next layer
• When the number of hidden layer
is increased, the network become
“deep”

Neuron
• Each w is called weight
• It is initialized to random values and learnt during the learning
process

Neural Network
Network learns
3*4 weights here
Network learns
4*3 weights here
Network learns
3*1 weights here
Each layer learns a
weight matrix of size
input_size*neuron_size

Word2vec in 2013
• Created by a team of researchers led by Tomas Mikolov at Google
• Proposed two models to learn word embedding
1. CBOW
2. Skip Grams

Word2Vec
• Given a word in a sentence, its N surrounding words are called
“context”
• Given a word, Skip Gram trains a single layer neural network to
predict a word from its context
Context size = 2

Idea behind Word2Vec
I like to eat apple a lot
I like to eat orange a lot
• Context of both are same {to, eat, a, lot}
• In such case, model learns similar vector representation for apple &
orange

Skip Gram Model
Input word
represented as one-
hot vector
Size = V We define a small N
e.g. 100 to 1000
Out is probability
distribution over V
words
Size of the weight
matrix = V*N

Power of word embedding vectors

In practice
• Word2Vec is trained using Google news corpus of size 6B
• Most frequent 1M words are set as the vocabulary
• Another one called Glove released in 2014
• Both are publicly available & commonly used
Word2Vec - https://code.google.com/archive/p/word2vec/
Glove - https://nlp.stanford.edu/projects/glove/

Next focus of the community?
• Words in a sentence are ordered
• Focus was on handling long sequential information using neural
networks
• Different types of neural networks are exploited with time
RNN -> BRU -> LSTM
• Adopting these architecture led to human level performance in many
application

Language Modelling (LM)
• Given a sequence of words, predict the next word
• X = sequence of words, Y = next word in the corpus
• First layer always learns word embedding matrix
Models knows
breakfast, lunch &
dinner are similar
terms.

Popular Language Models
• GPT3 – Released by OpenAI
Largest model so far. 175B parameters in the neural network
Trained using everything scrapped from the internet
Team itself warned about the misuse of the model
• BERT – Released by Google
• CodeBERT – Released by Microsoft

Was it written by a machine or a human
Talk to Transformers: https://app.inferkit.com/demo

Can these models replace human?
• Chatbots for customer assistance is already there
• Soon its going to challenge many other professions that involves
writing including software developers1
1. https://analyticsindiamag.com/5-jobs-that-gpt-3-might-challenge/

Future of NLP
• Human level performance in existing research areas
• Replacing human with machines wherever it is possible
• Scaling the performance with increase of content in internet
• Expanding the research to more fields e.g. medical, literature
• Understanding the user with what they write

An Introduction to Recent Advances in the Field of NLP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Introduction to Recent Advances in the Field of NLP

Similar to An Introduction to Recent Advances in the Field of NLP (20)

Recently uploaded

Recently uploaded (20)

An Introduction to Recent Advances in the Field of NLP