A not-so-short
introduction to
Deep Learning NLP
Francesco Gadaleta, PhD
1
worldofpiggy.com
What we do today
NLP introduction (<5 min)
Deep learning introduction (10 min)
What do we want (5 min)
How do we get there (15 min)
Demo (5 min)
What’s next (5 min)
Demo (5 min)
Questions (10 min)
2
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The Goals of NLP
Analysis of (free) text
Extract knowledge/abstract concepts from textual data
(text understanding)
Generative models (chat bot, AI assistants, ...)
Word/Paragraph similarity/classification
Sentiment analysis
3
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Traditional ML and
NLP
4
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Traditional NLP word representation
0 0 0 0 1 0 0 0 0 0
One-hot encoding of words: binary vectors of <vocabulary_size>
dimensions
0 0 0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
“book”
“chapter”
“paper”
AND
AND
= 0
5
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Traditional soft-clustering word
representation
Soft clustering models learn for each cluster/topic a distribution over
words of how likely that word is in each cluster
• Latent Semantic Analysis (LSA/LSI), Random projections
• Latent Dirichlet Analysis (LDA), HMM clustering
6
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
LSA - Latent Semantic Analysis
Words that are close in meaning will occur in similar pieces of text.
Good for not-so-large text data
SVD to reduce words and preserve similarity among paragraphs
paragraphs
words
Similarity = cosine(vec(w1), vec(w2))
Low-rank
No Polysemy
Poor Synonymy
Bag-of-word limitations (no order)
U
V
M=U *
Huge, Sparse, Noisy
7
X
word counts/paragraph
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Traditional ML and
Deep Learning
8
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The past and the present
Human-designed
representation
blah blahBlah blah
blah Blah blah blah
blah blahBlah blah
blah blah
Handcrafted
sound features
ML model Predictions
Regression
Clustering
Random Forest
SVM
KNN
...
Weight
Optimization
9
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The future
Representation Learning automatically learn good
features or representations
Deep Learning learn multiple levels of representation with
increasing complexity and abstraction
10
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The promises of AI (1969-2016)
11
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Brief history of AI
1958 Rosenblatt’s
perceptron
1974
Backpropagation
1998
ConvNets
2012 Google
Brain Project
1995
Kernel
methods (SVM)
2006
Restricted
Boltzmann
Machine
AI winter AI spring AI summer
12
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Why is this happening?
BIG
Data
GPU
Power
ALGO
Progress
13
Geoffrey Hinton
Cognitive psychologist AND Professor
at University of Toronto AND one of
the first to demonstrate the use of
generalized backpropagation to train
multi-layer networks.
Known for Backpropagation OR
Boltzmann machine AND great-great-
grandson of logician George Boole
14
Yann LeCun
Postdoc at Hinton’s lab.
Developed DJVu format.
Father of Convolutional Neural
Networks and Optical Character
Recognition (OCR).
Proposed bio inspired ML methods like
“Optimal Brain Damage” a
regularization method.
LeNet-5 is now state-of-the-art in
artificial vision.
15
Yoshua Bengio
Professor at University Montreal.
Many contributions in Deep
Learning.
Known for Gradient-based
learning, word representations
and representation learning for
NLP.
16
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Some reasons to apply
Deep Learning
(non-exhaustive list)
17
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
No. 1
Automatic Representation Learning
1. Who wants to manually
prepare features?
2. Often over-specified or
incomplete (or both)
3. Done? Cool!
Now do it again and again...
Input Data
Feature
Engineering
ML
algorithm
Time consuming
ML
Algorithm 1
ML
Algorithm 2
ML
Algorithm 3
Domain #1
Domain #2
Domain #3
Validation
Validation
Validation
18
Feature
engineering
Feature
engineering
Feature
engineering
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
No. 2
Learning from unlabeled data
Traditional NLP requires labeled training data
Guess what?
Almost all data is unlabeled
Learning how data is generated is essential to ‘understand’ data
[Demo]
19
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
No. 3
Metric Learning
Similarity
Dissimilarity
Distance matrix
Kernel
Define
please!
20
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
No. 4
Human language is recursive
“People that don't know me think I'm shy.
People that do know me wish I were.”
Recursion
Same operator
applied to different
components (RNN)
21
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Some examples
22
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
LeNet (proposed in 1998 by Yan LeCun)
● Convolutional Neural Network for reading bank checks
● All units of a feature map share same set of weights
Detect same feature at all possible locations of input
Robust to shifts and distortions
23
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
GoogLeNet (proposed in 2014 by Szegedy et al.)
Specs
22 layers
12x less parameters than winning network ILSVRC 2012 challenge
Introduced Inception module (filters similar to the primate visual cortex) to find out how a local sparse structure can
be approximated by readily available dense components
Too deep => gradient propagation problems => classifiers added in the middle of the network :)
Object recognition
Captioning
Classification
Scene description (*)
(*) with semantically valid phrases.
24
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
A not-so-classic example
“Kideatingicecream”
25
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Neural Image Captioning
26
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Sentiment analysis
Task
Socher et al. use RNN for sentiment
prediction
Demo http://nlp.stanford.
edu/sentiment
27
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Neural Generative Model
Character-based RNN
Text Alice in
Wonderland
Corpus len 167546
Unique chars 85
# sequences 55842
Context chars 20
Epochs 280
CPU Intel i7
GPU NVIDIA 560M
RAM 16 GB
neural networks are fun
neural networks are fun
neural networks are fun
neural networks are fun
neural networks are fun
INPUT <20x85> OUTPUT <1x85>
o
r
r
f
e
28
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
demo
29
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Neural Network Architectures
image - class image - caption sentence - class sentence - sentence sequence - sequence
30
How many
neural networks
for speech recognition
and NLP tasks?
31
Just one (*)
Layers
Output: predict supervised target
Hidden: learn abstract
representations
Input: raw sensory inputs.
(*) Provided you don’t fall for exotic stuff 32
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
NN architecture: Single Neuron
n (3) inputs, 1 output, parameters W, b
x1
x2
x3
b=+1
hw,b(x)
Logistic activation function
33
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Many Single Neurons make a Network
Input Layer Layer 1 Layer 2
Learning
Many logistic regressions
at the same time
Hidden: neurons have no
meaning for humans
Output to be predicted
stays the same
Layer 3 Output Layer
x1
x2
x3
b=+1
34
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Neural Networks in a
(not-so-small) nutshell
*** DISCLAIMER ***
After this section the charming and
fascinating halo surrounding Neural
Networks and Deep Learning will be gone.
35
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The core of a Neural Network
x1
x2
x3
b=+1
36
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The core of a Neural Network
x1
x2
x3
b=+1
W1 W2
(Logistic regression) (Logistic regression)
b1
b2
37
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The core of a Neural Network
(Logistic regression)
SGD Stochastic
Gradient Descent
Backpropagation
(at each layer)
38
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Non-linear
Activation Functions
RectifiedLinearUnit
➔ fast
➔ more expressive than
logistic function
➔ prevents vanishing
gradients
39
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Optimization Functions
Stochastic Gradient Descent
➔ fast
➔ adaptive (Ada, RMS)
➔ handle many dimensions
40
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Fixed-sized-input Neural Networks
Assumption:
we are happy with 5-gram
input (really?)
41
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Recurrent Neural Networks
Fact:
n-gram input has a lot of limitations
42
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Neural Networks and Text
the
cat
sat
b=+1
W1 W2b1 b2Emb
<vocsize, embsize> <hidden, class><hidden, hidden>
vocabulary size = 1000
embedding size = 50
context = 20
classes = 2, 10, 100 (depends on the problem)
next word
sentiment
PoS tagging
43
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Neural Networks and Text
Emb
<vocsize, embsize>
Words are represented as numeric vectors
(can subtract, add, group, cluster,...)
Similarity kernel (learned)
This is “knowledge” that can be transferred
+1.4% F1 Dependency Parsing 15.2% error reduction (Koo & Collins 2008, Brown
clustering)
+3.4% F1 Named Entity Recognition 23.7% error reduction (Stanford NER, exchange
clustering)
44
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Word Embedding: plotting
Courtesy of Christopher Olah
45
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Courtesy of Christopher Olah
Word Embedding: algebraic operations
MAN + ‘something good’ == WOMAN
WOMAN - ‘something bad’ == MAN
MAN + ‘something’ == WOMAN
KING + ‘something’ == QUEEN
Identification of text regularities in [3] with 80-1600
dimensions, 320M words Broadcast news, 82k unique
words.
46
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Demo: word embeddings
Training set 9 GB free text
Vocabulary size 50000
Embedding
dimensions
256
Context window 10
Skip top common
words
100
Layers [10,100,512,1]
Embeddings <50000, 256>
47
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Feeding the network
Neural nets are fun and we are happy
1
Ted, Sarandos who runs Netflix’s Hollywood banana
(operation)
and
0
makes the company’s deals,. with networks and he
1
studios was up first to beer
(rehearse)
his lines
0
48
Em
b
<50000x256>
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Demo
word embeddings: pre-processing
Remove HTML tags
replace unicode
utf-8 encode
tokenize
4-node Spark cluster
49
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA 50
demo
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
What’s Next
from word to document embeddings
Distributed Representations of Sentences and
Documents
Quoc Le, Tomas Mikolov, Google Inc
Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel,
Antonio Torralba, Raquel Urtasun, Sanja Fidler
51
Who is
‘deep learning’?
Twitter,
Pinterest,
News delivery,
broadcast
Google Self
Driving car,
Smart Reply,
Ads.
Google, Alphabet
Facebook
automatic
tagging, text
understanding
Facebook, Inc.
52
Deep learning has simplified feature engineering in many cases
(it certainly hasn't removed it)
Less feature engineering is leading to more complex machine learning
architectures
Most of the time, these model architectures are as specific to a given task
as feature engineering used to be.
Conclusion
The job of the data scientist will stay sexy for a while
(keep your fingers crossed on this one).
53
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
References
[1] Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts
Stanford University, Stanford, CA 94305, USA
[2] Document Embedding with Paragraph Vectors
Andrew M. Dai, Christopher Olah, Quoc V. Le Google
[3] Linguistic Regularities in Continuous Space Word Representations
Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig, Microsoft Research
[4] Distributed Representations of Sentences and Documents
Quoc Le, Tomas Mikolov, Google Inc
[5] Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
[6] Text Understanding from Scratch
Xiang Zhang, Yann LeCun Computer Science Department, Courant Institute of Mathematical Sciences, New York
University
[7] World of Piggy - Data Science at Home Podcast - History and applications of Deep Learning http://worldofpiggy.
com/history-and-applications-of-deep-learning-a-new-podcast-episode/
54
A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
Thank you
55
github.com/worldofpiggy @worldofpiggy worldofpiggy@gmail.com worldofpiggy.com

Deeplearning NLP

  • 1.
    A not-so-short introduction to DeepLearning NLP Francesco Gadaleta, PhD 1 worldofpiggy.com
  • 2.
    What we dotoday NLP introduction (<5 min) Deep learning introduction (10 min) What do we want (5 min) How do we get there (15 min) Demo (5 min) What’s next (5 min) Demo (5 min) Questions (10 min) 2
  • 3.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA The Goals of NLP Analysis of (free) text Extract knowledge/abstract concepts from textual data (text understanding) Generative models (chat bot, AI assistants, ...) Word/Paragraph similarity/classification Sentiment analysis 3
  • 4.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Traditional ML and NLP 4
  • 5.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Traditional NLP word representation 0 0 0 0 1 0 0 0 0 0 One-hot encoding of words: binary vectors of <vocabulary_size> dimensions 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 “book” “chapter” “paper” AND AND = 0 5
  • 6.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Traditional soft-clustering word representation Soft clustering models learn for each cluster/topic a distribution over words of how likely that word is in each cluster • Latent Semantic Analysis (LSA/LSI), Random projections • Latent Dirichlet Analysis (LDA), HMM clustering 6
  • 7.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA LSA - Latent Semantic Analysis Words that are close in meaning will occur in similar pieces of text. Good for not-so-large text data SVD to reduce words and preserve similarity among paragraphs paragraphs words Similarity = cosine(vec(w1), vec(w2)) Low-rank No Polysemy Poor Synonymy Bag-of-word limitations (no order) U V M=U * Huge, Sparse, Noisy 7 X word counts/paragraph
  • 8.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Traditional ML and Deep Learning 8
  • 9.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA The past and the present Human-designed representation blah blahBlah blah blah Blah blah blah blah blahBlah blah blah blah Handcrafted sound features ML model Predictions Regression Clustering Random Forest SVM KNN ... Weight Optimization 9
  • 10.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA The future Representation Learning automatically learn good features or representations Deep Learning learn multiple levels of representation with increasing complexity and abstraction 10
  • 11.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA The promises of AI (1969-2016) 11
  • 12.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Brief history of AI 1958 Rosenblatt’s perceptron 1974 Backpropagation 1998 ConvNets 2012 Google Brain Project 1995 Kernel methods (SVM) 2006 Restricted Boltzmann Machine AI winter AI spring AI summer 12
  • 13.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Why is this happening? BIG Data GPU Power ALGO Progress 13
  • 14.
    Geoffrey Hinton Cognitive psychologistAND Professor at University of Toronto AND one of the first to demonstrate the use of generalized backpropagation to train multi-layer networks. Known for Backpropagation OR Boltzmann machine AND great-great- grandson of logician George Boole 14
  • 15.
    Yann LeCun Postdoc atHinton’s lab. Developed DJVu format. Father of Convolutional Neural Networks and Optical Character Recognition (OCR). Proposed bio inspired ML methods like “Optimal Brain Damage” a regularization method. LeNet-5 is now state-of-the-art in artificial vision. 15
  • 16.
    Yoshua Bengio Professor atUniversity Montreal. Many contributions in Deep Learning. Known for Gradient-based learning, word representations and representation learning for NLP. 16
  • 17.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Some reasons to apply Deep Learning (non-exhaustive list) 17
  • 18.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA No. 1 Automatic Representation Learning 1. Who wants to manually prepare features? 2. Often over-specified or incomplete (or both) 3. Done? Cool! Now do it again and again... Input Data Feature Engineering ML algorithm Time consuming ML Algorithm 1 ML Algorithm 2 ML Algorithm 3 Domain #1 Domain #2 Domain #3 Validation Validation Validation 18 Feature engineering Feature engineering Feature engineering
  • 19.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA No. 2 Learning from unlabeled data Traditional NLP requires labeled training data Guess what? Almost all data is unlabeled Learning how data is generated is essential to ‘understand’ data [Demo] 19
  • 20.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA No. 3 Metric Learning Similarity Dissimilarity Distance matrix Kernel Define please! 20
  • 21.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA No. 4 Human language is recursive “People that don't know me think I'm shy. People that do know me wish I were.” Recursion Same operator applied to different components (RNN) 21
  • 22.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Some examples 22
  • 23.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA LeNet (proposed in 1998 by Yan LeCun) ● Convolutional Neural Network for reading bank checks ● All units of a feature map share same set of weights Detect same feature at all possible locations of input Robust to shifts and distortions 23
  • 24.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA GoogLeNet (proposed in 2014 by Szegedy et al.) Specs 22 layers 12x less parameters than winning network ILSVRC 2012 challenge Introduced Inception module (filters similar to the primate visual cortex) to find out how a local sparse structure can be approximated by readily available dense components Too deep => gradient propagation problems => classifiers added in the middle of the network :) Object recognition Captioning Classification Scene description (*) (*) with semantically valid phrases. 24
  • 25.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA A not-so-classic example “Kideatingicecream” 25
  • 26.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Neural Image Captioning 26
  • 27.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Sentiment analysis Task Socher et al. use RNN for sentiment prediction Demo http://nlp.stanford. edu/sentiment 27
  • 28.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Neural Generative Model Character-based RNN Text Alice in Wonderland Corpus len 167546 Unique chars 85 # sequences 55842 Context chars 20 Epochs 280 CPU Intel i7 GPU NVIDIA 560M RAM 16 GB neural networks are fun neural networks are fun neural networks are fun neural networks are fun neural networks are fun INPUT <20x85> OUTPUT <1x85> o r r f e 28
  • 29.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA demo 29
  • 30.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Neural Network Architectures image - class image - caption sentence - class sentence - sentence sequence - sequence 30
  • 31.
    How many neural networks forspeech recognition and NLP tasks? 31
  • 32.
    Just one (*) Layers Output:predict supervised target Hidden: learn abstract representations Input: raw sensory inputs. (*) Provided you don’t fall for exotic stuff 32
  • 33.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA NN architecture: Single Neuron n (3) inputs, 1 output, parameters W, b x1 x2 x3 b=+1 hw,b(x) Logistic activation function 33
  • 34.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Many Single Neurons make a Network Input Layer Layer 1 Layer 2 Learning Many logistic regressions at the same time Hidden: neurons have no meaning for humans Output to be predicted stays the same Layer 3 Output Layer x1 x2 x3 b=+1 34
  • 35.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Neural Networks in a (not-so-small) nutshell *** DISCLAIMER *** After this section the charming and fascinating halo surrounding Neural Networks and Deep Learning will be gone. 35
  • 36.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA The core of a Neural Network x1 x2 x3 b=+1 36
  • 37.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA The core of a Neural Network x1 x2 x3 b=+1 W1 W2 (Logistic regression) (Logistic regression) b1 b2 37
  • 38.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA The core of a Neural Network (Logistic regression) SGD Stochastic Gradient Descent Backpropagation (at each layer) 38
  • 39.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Non-linear Activation Functions RectifiedLinearUnit ➔ fast ➔ more expressive than logistic function ➔ prevents vanishing gradients 39
  • 40.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Optimization Functions Stochastic Gradient Descent ➔ fast ➔ adaptive (Ada, RMS) ➔ handle many dimensions 40
  • 41.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Fixed-sized-input Neural Networks Assumption: we are happy with 5-gram input (really?) 41
  • 42.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Recurrent Neural Networks Fact: n-gram input has a lot of limitations 42
  • 43.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Neural Networks and Text the cat sat b=+1 W1 W2b1 b2Emb <vocsize, embsize> <hidden, class><hidden, hidden> vocabulary size = 1000 embedding size = 50 context = 20 classes = 2, 10, 100 (depends on the problem) next word sentiment PoS tagging 43
  • 44.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Neural Networks and Text Emb <vocsize, embsize> Words are represented as numeric vectors (can subtract, add, group, cluster,...) Similarity kernel (learned) This is “knowledge” that can be transferred +1.4% F1 Dependency Parsing 15.2% error reduction (Koo & Collins 2008, Brown clustering) +3.4% F1 Named Entity Recognition 23.7% error reduction (Stanford NER, exchange clustering) 44
  • 45.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Word Embedding: plotting Courtesy of Christopher Olah 45
  • 46.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Courtesy of Christopher Olah Word Embedding: algebraic operations MAN + ‘something good’ == WOMAN WOMAN - ‘something bad’ == MAN MAN + ‘something’ == WOMAN KING + ‘something’ == QUEEN Identification of text regularities in [3] with 80-1600 dimensions, 320M words Broadcast news, 82k unique words. 46
  • 47.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Demo: word embeddings Training set 9 GB free text Vocabulary size 50000 Embedding dimensions 256 Context window 10 Skip top common words 100 Layers [10,100,512,1] Embeddings <50000, 256> 47
  • 48.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Feeding the network Neural nets are fun and we are happy 1 Ted, Sarandos who runs Netflix’s Hollywood banana (operation) and 0 makes the company’s deals,. with networks and he 1 studios was up first to beer (rehearse) his lines 0 48 Em b <50000x256>
  • 49.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Demo word embeddings: pre-processing Remove HTML tags replace unicode utf-8 encode tokenize 4-node Spark cluster 49
  • 50.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA 50 demo
  • 51.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA What’s Next from word to document embeddings Distributed Representations of Sentences and Documents Quoc Le, Tomas Mikolov, Google Inc Skip-Thought Vectors Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler 51
  • 52.
    Who is ‘deep learning’? Twitter, Pinterest, Newsdelivery, broadcast Google Self Driving car, Smart Reply, Ads. Google, Alphabet Facebook automatic tagging, text understanding Facebook, Inc. 52
  • 53.
    Deep learning hassimplified feature engineering in many cases (it certainly hasn't removed it) Less feature engineering is leading to more complex machine learning architectures Most of the time, these model architectures are as specific to a given task as feature engineering used to be. Conclusion The job of the data scientist will stay sexy for a while (keep your fingers crossed on this one). 53
  • 54.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA References [1] Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts Stanford University, Stanford, CA 94305, USA [2] Document Embedding with Paragraph Vectors Andrew M. Dai, Christopher Olah, Quoc V. Le Google [3] Linguistic Regularities in Continuous Space Word Representations Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig, Microsoft Research [4] Distributed Representations of Sentences and Documents Quoc Le, Tomas Mikolov, Google Inc [5] Skip-Thought Vectors Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler [6] Text Understanding from Scratch Xiang Zhang, Yann LeCun Computer Science Department, Courant Institute of Mathematical Sciences, New York University [7] World of Piggy - Data Science at Home Podcast - History and applications of Deep Learning http://worldofpiggy. com/history-and-applications-of-deep-learning-a-new-podcast-episode/ 54
  • 55.
    A NOT-SO-SHORT INTRODUCTIONTO DEEP LEARNING NLP - FRANCESCO GADALETA Thank you 55 github.com/worldofpiggy @worldofpiggy worldofpiggy@gmail.com worldofpiggy.com