NATURAL LANGUAGE
PROCESSING
DEEP LEARNING FOR
MACHINE LEARNING ENGINEER
WILDER RODRIGUES
• Coursera Mentor
• City.AI Ambassador;
• IBM Watson AI XPRIZE contestant;
• Kaggler;
• Guest attendee at AI for
Good Global Summit at the UN;
• X-Men geek;
• family man and father of 5 (3 kids and
2 cats).
@wilderrodrigues
https://medium.com/@wilder.rodrigues/
WHAT IS IN THERE FOR YOU?
AGENDA
• The Basics
• Vector Representation of Words
• The Shallow
• [Deep] Neural Networks for NLP
• The Deep
• Convolutional Networks for NLP
• The Recurrent
• Long-short Term Memory for NLP
• Where do we go from here?
• Automation of AWS GPUs with Terraform
VECTOR
REPRESENTATION
OF WORDS
THE BASICS
REPRESENTATIONS OF LANGUAGE
HOW DOES IT WORK?
WORD2VEC
• Cosine distance between words in the vector
space:
• X = vector(”biggest”)−vector(”big”) +
vector(”small”)
• X = smallest
• Algorithms:
• Skip-Gram
• It predicts the context words from the
target words.
• CBOW
• It predicts the target word from the bag of
all context words.
Cosine Distance Euclidian Distance
The CBOW architecture predicts the current word based on the context,
and the Skip-gram predicts surrounding words given the current word.
DEMO
WORD2VEC
[DEEP]
NEURAL
NETWORKS
THE SHALLOW
WHERE TO FOCUS FOR NOW?
DEMO
SENTIMENT ANALYSIS
CONVOLUTIONAL
NEURAL
NETWORKS
THE DEEP
HOW THEY WORK?
CNNS
• Filters
• Kernel
• Strides
• Padding
• One equation to rule them all:
* =
6x6x3
3x3x3
4x4x16
4x4x16
2x2x16 2x2x16
* =
2
6
3
3
6
4
7
9
8
3
1
-1
4
0
0
4
2
3
91
1
6
2
3
2
5
7
9
7
2
1
4
3
2
7
7
4
8
2
6
7
3
4
4
3
9
1
55
(6 + 2 . 0 - 3) / 1 + 1 = 4
(6 + 2 . 0 - 3) / 1 + 1 = 4
16
4x4x16
HOW THEY WORK WITH TEXT?
CNNS
• Each row of the matrix corresponds
to a word/token. Meaning, each row
is a low-dimensional vector that
represents a word/token.
• The width of the filters is usually the
same as the width of the input
matrix.
• The height may vary, but it’s typically
between 2 and 5. So, for a 2x5 filter
it means we would cover 2 words
per sliding window.
DEMO
SENTIMENT ANALYSIS
LONG
SHORT TERM
MEMORY
THE RECURRENT
LONG-TERM DEPENDENCIES PROBLEMS
RNNS
• Small vs Large gap between the
relevant information for the
prediction:
• “the clouds are in the sky.”;
• “I grew up in France… I speak
fluent French.”.
HOW THEY WORK?
LSTMS
• LSTMs’ Gates:
• Forget
• Decides whether the state will be passed through
or not.
• Input
• Decides on which values to update and then feeds
a tanh which will output the next Candidate state.
• Update the new state based on the previous one
plus the candidate state.
• Output
• Feeds a sigmoid function to decide which parts of
the state will be output.
• Feeds a tanh function with the State and multiplies
its output with the sigmoid result.
DEMO
SENTIMENT ANALYSIS
TERRAFORM
WHERE DO WE GO
FROM HERE?
INFRASTRUCTURE AS CODE
BUILDING A LANDSCAPE
• Abstracts resources and providers:
• Physical hardware;
• Virtual machines; and
• Containers.
• Multi-Tier Applications
• Multi-Cloud Deployment
• Software Demos
DEMO
PUT IT ALL TOGETHER
WHERE DID I GET THIS STUFF FROM?
REFERENCES
• Efficient Estimation of Word Representations in Vector Space: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google,
2013.
• A Neural Probabilistic Language Model: Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. Université de
Montréal, Montréal, Québec, Canada, 2013.
• Dropout: A Simple Way to Prevent Neural Networks from Overfitting: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Sutskever, Ruslan Salakhutdinov. University of Toronto, Toronto, Ontario, Canada.
• https://medium.com/cityai/deep-learning-for-natural-language-processing-part-i-8369895ffb98
• https://medium.com/cityai/deep-learning-for-natural-language-processing-part-ii-8b2b99b3fa1e
• https://medium.com/cityai/deep-learning-for-natural-language-processing-part-iii-96cfc6acfcc3
• http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
• https://github.com/ekholabs/DLinK
• https://github.com/ekholabs/automated_ml
Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing

  • 1.
  • 2.
    MACHINE LEARNING ENGINEER WILDERRODRIGUES • Coursera Mentor • City.AI Ambassador; • IBM Watson AI XPRIZE contestant; • Kaggler; • Guest attendee at AI for Good Global Summit at the UN; • X-Men geek; • family man and father of 5 (3 kids and 2 cats). @wilderrodrigues https://medium.com/@wilder.rodrigues/
  • 3.
    WHAT IS INTHERE FOR YOU? AGENDA • The Basics • Vector Representation of Words • The Shallow • [Deep] Neural Networks for NLP • The Deep • Convolutional Networks for NLP • The Recurrent • Long-short Term Memory for NLP • Where do we go from here? • Automation of AWS GPUs with Terraform
  • 4.
  • 5.
  • 6.
    HOW DOES ITWORK? WORD2VEC • Cosine distance between words in the vector space: • X = vector(”biggest”)−vector(”big”) + vector(”small”) • X = smallest • Algorithms: • Skip-Gram • It predicts the context words from the target words. • CBOW • It predicts the target word from the bag of all context words. Cosine Distance Euclidian Distance The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    HOW THEY WORK? CNNS •Filters • Kernel • Strides • Padding • One equation to rule them all: * = 6x6x3 3x3x3 4x4x16 4x4x16 2x2x16 2x2x16 * = 2 6 3 3 6 4 7 9 8 3 1 -1 4 0 0 4 2 3 91 1 6 2 3 2 5 7 9 7 2 1 4 3 2 7 7 4 8 2 6 7 3 4 4 3 9 1 55 (6 + 2 . 0 - 3) / 1 + 1 = 4 (6 + 2 . 0 - 3) / 1 + 1 = 4 16 4x4x16
  • 13.
    HOW THEY WORKWITH TEXT? CNNS • Each row of the matrix corresponds to a word/token. Meaning, each row is a low-dimensional vector that represents a word/token. • The width of the filters is usually the same as the width of the input matrix. • The height may vary, but it’s typically between 2 and 5. So, for a 2x5 filter it means we would cover 2 words per sliding window.
  • 14.
  • 15.
  • 16.
    LONG-TERM DEPENDENCIES PROBLEMS RNNS •Small vs Large gap between the relevant information for the prediction: • “the clouds are in the sky.”; • “I grew up in France… I speak fluent French.”.
  • 17.
    HOW THEY WORK? LSTMS •LSTMs’ Gates: • Forget • Decides whether the state will be passed through or not. • Input • Decides on which values to update and then feeds a tanh which will output the next Candidate state. • Update the new state based on the previous one plus the candidate state. • Output • Feeds a sigmoid function to decide which parts of the state will be output. • Feeds a tanh function with the State and multiplies its output with the sigmoid result.
  • 18.
  • 19.
    TERRAFORM WHERE DO WEGO FROM HERE?
  • 20.
    INFRASTRUCTURE AS CODE BUILDINGA LANDSCAPE • Abstracts resources and providers: • Physical hardware; • Virtual machines; and • Containers. • Multi-Tier Applications • Multi-Cloud Deployment • Software Demos
  • 21.
  • 22.
    WHERE DID IGET THIS STUFF FROM? REFERENCES • Efficient Estimation of Word Representations in Vector Space: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google, 2013. • A Neural Probabilistic Language Model: Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. Université de Montréal, Montréal, Québec, Canada, 2013. • Dropout: A Simple Way to Prevent Neural Networks from Overfitting: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. University of Toronto, Toronto, Ontario, Canada. • https://medium.com/cityai/deep-learning-for-natural-language-processing-part-i-8369895ffb98 • https://medium.com/cityai/deep-learning-for-natural-language-processing-part-ii-8b2b99b3fa1e • https://medium.com/cityai/deep-learning-for-natural-language-processing-part-iii-96cfc6acfcc3 • http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ • https://github.com/ekholabs/DLinK • https://github.com/ekholabs/automated_ml