Introducción a NLP en Azure: Las 5 etapas clave de un proyecto de NLP

@zNk
jasanchez@plainconcepts.com
• Interested in:
• Machine Learning (NLP, productionalize models)
• Highly scalable architectures (based on hadoop, spark, etc)
• Software Engineering (i really like C#)
• Gaming: Pubg, LoL
Jesús A. Sánchez Méndez
Data Engineer

@ematde
ematallanas@plainconcepts.com
• Knowmad, working with data long time ago
• Interested in:
• AI applied to industry
• Research in robotics, new models and general use of ML
• Love films and series
Eduardo Matallanas
Data Engineer

Things that you will learn today (hopefully)
1. What is NLP
2. How to face a NLP project
3. Different approaches to model NLP
4. Different techniques involved in the creation a NLP model
5. Libs and frameworks that we use and work/worked

Language is a process of free creation; its laws and
principles are fixed; but the manner in which the
principles of generation are used is free and
infinitely varied. Even the interpretation and used
of words involves a process of free creation.
Noam Chomsky

Excitations Input channel
Processing
Output

Context
Recognition
Word
analysis
Sentiment
Analysis
Answer
Entity
recognition
Synonyms
Disambiguation
Lexical
semantics

How can a
virtual
assistant do it?

{ }
Language
Undertanding (LUIS) Amazon Lex
Privada

101: How to build
your own NLP

Facing a NLP project? Step to success
Data Collection
& Assembly
Data
Preprocessing
Data Exploration
& Visualization
Model Building
Model
Evaluation

The importance
of data
preprocessing

Gather data
• Different data sources
• The nature of the data is important
• Which data is really important to build your
model?

Tokenization
Easy, huh?
What about this? Jesus’ car wasn’t fined since he’s
resident that well-known neighbor
That’s not as easy as: Jesus’ car was not fined since
he is resident in that neighbor

Stemming
EATING -> EAT
RUNNING -> RUN
AMUSING -> AMUS
GRATEFULLY -> GRATE
STUDIES -> STUDI

LEMMATIZING
AM, ARE, IS -> BE
CARS, CAR, CAR’S -> CAR
DESTRUCTION -> DESTROY

Quite often used techniques
• Sentence to lower case
• Remove numbers (or even translate
them to text)
• Remove punctuation (this is usually
done in tokenization stage)
• Trimming
• Remove stopwords
• Depending on the domain, remove
sparse terms
• It was a pleasure David  it was a pleasure david
• My 15 birthday  My fifteenth birthday
• The house, in Madrid, was smaller.  The house in
Madrid was smaller
• the car broke in front of the building

Quick reminder: Machine Learning models?
𝑌 = 𝑓(𝑋)The output
we want to
predict
Our input
data (X)
The function
itself is the
model that we
are going to
use to predict
the output (Y)

Does this make sense?
𝑓 𝑋 = 𝑋2 𝑓 "ℎ𝑒𝑦 𝑤ℎ𝑎𝑡′𝑠 𝑢𝑝? " = ?
Of course it doesn’t…

How can we build the model?
1. What is needed?  Understand words (input)
2. What is expected?  labels, numbers, words, etc.
(output)
Need a vocabulary!!!
Model
Input Output

𝑤𝑡ℎ𝑒,2 = 2 𝑥 log(
4
2
) = 0,6
𝑤𝑓𝑜𝑥,1 = 1 𝑥 log(
1
1
) = 0

Second approximation
Lots of words
High representation of general information
Encode the data based on the order of appearance  e.g. Hashing function
Too large dimensionality  PCA for reduction
Clustering the data  K-means
WIKIPEDIA1
2
3
4

Word
embedding
Words or phrases are mapped to vectors of real
numbers
Relationship of words
•CBOW  Predict a word given the context
•Skipgram  Predict the context
Dense and low-dimensional vectors
Their representation is learned by the usage of words

Word2Vec
Two layer Neural Network NOT A DEEP NN
Applied to any written text
(words, sentence, paragraph, etc)

Word2Vec, CBOW? Skip-Gram?
Predicts a word Predicts context

Word2Vec, CBOW? Skip-Gram?
Input  a corpus
Approach  Skip-gram
Outpout  a dictionary [word,vector]
Word Distance
Portugal 0,83447
France 0,72341
Moroco 0,69847
Andorra 0,653456
e.g. Spain

• Created by Facebook AI Research lab
• Pretrained models for 294 languages
• Extract a relationship even if the word is not in the
corpus
• Enrich an existing model by assigning more weight
to your own trained vectors
• Fastest lib
• Allows training with CBOW or skipgram
• Provides classification and clustering capabilities

Nice, this is a solution for
everything!
• Indeed, whatever floats your boat
• Quite often embedding is used to feed another model
• Classifier o regression model
• Deep neural network
Model
Input OutputWord
Embedding ?

Sounds good but it has to have a disadvantage
[0,0,0,0,0,0,0,0,0] (needs retrain)
What happens when you try get a vector of a word that does not
appear in the corpus?
Coupled to the language

Applications
• Multilanguage support • Relationship from language to language

king:queen [woman,
Attempted abduction,
teenager, girl]
man

China:Taiwan [Ukraine, Moscow,
Moldova, Armenia]
Rusia

Republican:Donald Trump
[Democratic, GOP,
Democrats, McCain]
Barack Obama

Building:Architech Software
[programmer,
SecurityCenter,
WinPcap]

Logic arithmetic (analogies)
Geopolitics: Iraq - Violence = Jordan
Distinction: Human - Animal = Ethics
President - Power = Prime Minister
Library - Books = Hall
Analogy: Stock Market ≈ Thermometer
Other examples
house:roof::castle:[dome, bell_tower, spire, crenellations,
turrets]
knee:leg::elbow:[forearm, arm, ulna_bone]
love:indifference::fear:[apathy, callousness, timidity,
helplessness, inaction]

Other
applications
• Information retrieval
• Text classification
• Sentiment analysis
• Entity recognition
• Extract keywords

Conclusion
Data and analysis are key
NLP models are complex structures
A general purpose model does not exist
Use framework and library to begin something small and
grow it
Hopefully you are now an expert in NLP algorithms

Thanks and …
See you soon!
Thanks also to the organization
Without whom this would not have been posible.

Biblio
• Preprocessing
• http://www.nltk.org/
• https://pandas.pydata.org/
• Word Embedding
• https://skymind.ai/wiki/word2vec
• https://www.tensorflow.org/tutorials/representation/word2vec
• https://fasttext.cc/
• https://nlp.stanford.edu/projects/glove/
• LSTM
• http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Pandas, numpyTrabajo en python
Fasttext

Introducción a NLP en Azure: Las 5 etapas clave de un proyecto de NLP

Recommended

Recommended

More Related Content

Similar to Introducción a NLP en Azure: Las 5 etapas clave de un proyecto de NLP

Similar to Introducción a NLP en Azure: Las 5 etapas clave de un proyecto de NLP (20)

More from Plain Concepts

More from Plain Concepts (20)

Recently uploaded

Recently uploaded (20)

Introducción a NLP en Azure: Las 5 etapas clave de un proyecto de NLP

Editor's Notes