Esta charla pretende introducir a la audiencia al mundo del procesamiento del lenguaje natural, o NLP por sus siglas en inglés (Natural Language Processing). La charla en sí constará de 3 bloques.1. Estado del arte en NLP. ¿Qué se está usando hoy en día? ¿Qué problemas podemos solucionar y qué problemas no? Técnicas comúnmente usados en la industria2. Introducción a conceptos básicos a la hora de afrontar un proyecto ML con NLP: preprocesado, vectorización y embedding (word2vec, fastText, técnicas básicas como tf-idf, counting, etc). Clasificadores.3. Pequeño ejemplo práctico con despliegue usando Azure Machine Learning.
3. @zNk
jasanchez@plainconcepts.com
• Interested in:
• Machine Learning (NLP, productionalize models)
• Highly scalable architectures (based on hadoop, spark, etc)
• Software Engineering (i really like C#)
• Gaming: Pubg, LoL
Jesús A. Sánchez Méndez
Data Engineer
4. @ematde
ematallanas@plainconcepts.com
• Knowmad, working with data long time ago
• Interested in:
• AI applied to industry
• Research in robotics, new models and general use of ML
• Love films and series
Eduardo Matallanas
Data Engineer
5. Things that you will learn today (hopefully)
1. What is NLP
2. How to face a NLP project
3. Different approaches to model NLP
4. Different techniques involved in the creation a NLP model
5. Libs and frameworks that we use and work/worked
8. Language is a process of free creation; its laws and
principles are fixed; but the manner in which the
principles of generation are used is free and
infinitely varied. Even the interpretation and used
of words involves a process of free creation.
Noam Chomsky
16. Facing a NLP project? Step to success
Data Collection
& Assembly
Data
Preprocessing
Data Exploration
& Visualization
Model Building
Model
Evaluation
22. Tokenization
Easy, huh?
What about this? Jesus’ car wasn’t fined since he’s
resident that well-known neighbor
That’s not as easy as: Jesus’ car was not fined since
he is resident in that neighbor
25. Quite often used techniques
• Sentence to lower case
• Remove numbers (or even translate
them to text)
• Remove punctuation (this is usually
done in tokenization stage)
• Trimming
• Remove stopwords
• Depending on the domain, remove
sparse terms
• It was a pleasure David it was a pleasure david
• My 15 birthday My fifteenth birthday
• The house, in Madrid, was smaller. The house in
Madrid was smaller
• the car broke in front of the building
28. Quick reminder: Machine Learning models?
𝑌 = 𝑓(𝑋)The output
we want to
predict
Our input
data (X)
The function
itself is the
model that we
are going to
use to predict
the output (Y)
29. Does this make sense?
𝑓 𝑋 = 𝑋2 𝑓 "ℎ𝑒𝑦 𝑤ℎ𝑎𝑡′𝑠 𝑢𝑝? " = ?
Of course it doesn’t…
31. How can we build the model?
1. What is needed? Understand words (input)
2. What is expected? labels, numbers, words, etc.
(output)
Need a vocabulary!!!
Model
Input Output
34. Second approximation
Lots of words
High representation of general information
Encode the data based on the order of appearance e.g. Hashing function
Too large dimensionality PCA for reduction
Clustering the data K-means
WIKIPEDIA1
2
3
4
35.
36.
37. Word
embedding
Words or phrases are mapped to vectors of real
numbers
Relationship of words
•CBOW Predict a word given the context
•Skipgram Predict the context
Dense and low-dimensional vectors
Their representation is learned by the usage of words
42. Word2Vec, CBOW? Skip-Gram?
Input a corpus
Approach Skip-gram
Outpout a dictionary [word,vector]
Word Distance
Portugal 0,83447
France 0,72341
Moroco 0,69847
Andorra 0,653456
e.g. Spain
43. • Created by Facebook AI Research lab
• Pretrained models for 294 languages
• Extract a relationship even if the word is not in the
corpus
• Enrich an existing model by assigning more weight
to your own trained vectors
• Fastest lib
• Allows training with CBOW or skipgram
• Provides classification and clustering capabilities
44. Nice, this is a solution for
everything!
• Indeed, whatever floats your boat
• Quite often embedding is used to feed another model
• Classifier o regression model
• Deep neural network
Model
Input OutputWord
Embedding ?
45. Sounds good but it has to have a disadvantage
[0,0,0,0,0,0,0,0,0] (needs retrain)
What happens when you try get a vector of a word that does not
appear in the corpus?
Coupled to the language
54. Conclusion
Data and analysis are key
NLP models are complex structures
A general purpose model does not exist
Use framework and library to begin something small and
grow it
Hopefully you are now an expert in NLP algorithms
Tokenization
Normalization
Stemming
Lemmatization
set all characters to lowercase
remove numbers (or convert numbers to textual representations)
remove punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
strip white space (also generally part of tokenization)
remove default stop words (general English stop words)
remove text file headers, footers
remove HTML, XML, etc. markup and metadata
extract valuable data from other formats, such as JSON, or from within databases
if you fear regular expressions, this could potentially be the part of text preprocessing in which your worst fears are realized
Tokenization
Normalization
Stemming
Lemmatization
set all characters to lowercase
remove numbers (or convert numbers to textual representations)
remove punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
strip white space (also generally part of tokenization)
remove default stop words (general English stop words)
remove text file headers, footers
remove HTML, XML, etc. markup and metadata
extract valuable data from other formats, such as JSON, or from within databases
if you fear regular expressions, this could potentially be the part of text preprocessing in which your worst fears are realized
Tokenization
Normalization
Stemming
Lemmatization
set all characters to lowercase
remove numbers (or convert numbers to textual representations)
remove punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
strip white space (also generally part of tokenization)
remove default stop words (general English stop words)
remove text file headers, footers
remove HTML, XML, etc. markup and metadata
extract valuable data from other formats, such as JSON, or from within databases
if you fear regular expressions, this could potentially be the part of text preprocessing in which your worst fears are realized
Aquí hago un recordatorio de que un modelo de ML puede ser descrito con algo tan simple como una función matemática
Aquí hago un recordatorio de que un modelo de ML puede ser descrito con algo tan simple como una función matemática
https://skymind.ai/wiki/bagofwords-tf-idf
To understand better a high representation is needed
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
Comentar que existe para cualquier representación
En el caso de que aparexcan palabras nuevas ¿que hacer?
En el caso de que aparexcan palabras nuevas ¿que hacer?
En el caso de que aparexcan palabras nuevas ¿que hacer?
En el caso de que aparexcan palabras nuevas ¿que hacer?