Deep Learning, an interactive introduction for NLP-ers

@graphiﬁc
Roelof Pieters
Introduc0on
to
 
Deep
Learning
for
NLP
22
January
2015
 
Stockholm
Natural
Language
Processing
Meetup
FEEDA
Slides at: 
http://www.slideshare.net/roelofp/220115dlmeetup
1

A couple of headlines… [all November ’14]
3

Machine Learning ??
- Audience Check -
5

• “Brain” inspired / simulations:
• vision: make learning algorithms  
better and easier to use
• goal: revolutions in (practical)  
advances for machine learning and AI
• Deep Learning = subﬁeld of Machine Learning
Deep Learning ??
6

DL: Impact
9
Speech Recognition

DL: Impact
10
Deep Learning for the win!
a few examples:
• IJCNN 2011 Traﬃc Sign Recognition Competition
• ISBI 2012 Segmentation of neuronal structures in EM stacks
challenge
• ICDAR 2011 Chinese handwriting recognition

• Deals with “construction and study of systems that can
learn from data”
Machine Learning ??
A computer program is said to learn from
experience (E) with respect to some class
of tasks (T) and performance measure (P),
if its performance at tasks in T, as measured
by P, improves with experience E
— T. Mitchell 1997
11

Machine Learning ??
Traditional Programming:
Data
Program
Output
Data
Program
Output
Machine Learning:
12

Supervised (inductive) learning
• Training data includes desired outputs
Unsupervised learning
• Training data does not include desired outputs
Semi-supervised learning
• Training data includes a few desired outputs
Reinforcement learning
• Rewards from sequence of actions
Types of Learning
13

ML: Traditional Approach
1. Gather as much LABELED data as you can get
2. Throw some algorithms at it (mainly put in an SVM and
keep it at that)
3. If you actually have tried more algos: Pick the best
4. Spend hours hand engineering some features / feature
selection / dimensionality reduction (PCA, SVD, etc)
5. Repeat…
For each new problem/question::
14

Machine Learning for NLP
Data
Classic Approach: Data is fed into a learning algorithm:
Learning  
Algorithm
15

some of the (many) treebank datasets
source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks
!
16

Penn Treebank
That’s a lot of “manual” work:
17

• the students went to class
DT NN VB P NN
• plays well with others
VB ADV P NN
NN NN P DT
• fruit ﬂies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
With a lot of issues:
Penn Treebank
18

Learning  
Algorithm
Data
“Features”
Prediction
Prediction/ 
Classiﬁer
train set
test set
19

Learning  
Algorithm
“Features”
Prediction
Prediction/ 
Classiﬁer
train set
test set
20

• Until the early 1990’s, NLP systems were built manually
with hand-crafted dictionaries and rules.
• As large electronic text corpora became increasingly
available, researchers began using machine learning
techniques to automatically build NLP systems.
• Today, the vast majority of NLP systems use machine
learning.
21

2. Neural Networks 
and a short history lesson
22

Perceptron (1957)
Frank Rosenblatt  
(1928-1971)
Original Perceptron
Simpliﬁed model:
(From Perceptrons by M. L Minsky and S. Papert,
1969, Cambridge, MA: MIT Press. Copyright 1969
by MIT Press.
23

Perceptron (1957)
Perceptron Research, youtube clip:  
https://www.youtube.com/watch?v=cNxadbrN_aI&feature=youtu.be&t=12
24

or
Multilayer Perceptron (1986)
inputs
weights
bias
activation
26

Neuron Model
All you need to know:
27

Backpropagation (1974/1986)
1974 Paul Werbos’ invents Backpropagation algorithm for NN
1986 Backdrop popularized by Rumelhart, Hinton, Williams
1990: Renewed Interest in NN’s
29

Backprop Renaissance
Forward Propagation
• Sum inputs, produce activation, feed-forward
30

Backprop Renaissance
Back Propagation (of error)
• Calculate total error at the top
• Calculate contributions to error at each step going
backwards
31

• Compute gradient of example-wise loss wrt
parameters
• Simply applying the derivative chain rule wisely  
 
 
• If computing the loss (example, parameters) is O(n)
computation, then so is computing the gradient
Backpropagation
32

Training procedure
• Initialize randomly
• Sequentially give it data.
• See what the diﬀerence is between network output
and actual output.
• Update the weights according to this error.
• End result: give a model input, and it produces a
proper output.
Quest for the weights. The weights are the model!
To reiterate:
34

So why only now?
• Inspired by the architectural depth of the brain,
researchers wanted for decades to train deep
multi-layer neural networks.
• No successful attempts were reported before 2006
…Exception: convolutional neural networks,
LeCun 1998
• SVM: Vapnik and his co-workers developed the
Support Vector Machine (1993) (shallow
architecture).
• Breakthrough in 2006!
35

2006 Breakthrough
• More data
• Faster hardware: GPU’s, multi-core CPU’s
• Working ideas on how to train deep architectures
36

2006 Breakthrough
• More data
37

2006 Breakthrough
• More data
39

2006 Breakthrough
• More data
• Working ideas on how to train deep
architectures
41

2006 Breakthrough
Stacked Restricted Boltzman Machines* (RBM)
Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). 
A fast learning algorithm for deep belief nets. 
Neural Computation, 18:1527-1554.
Stacked Autoencoders (AE)
Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). 
Greedy Layer-Wise Training of Deep Networks, 
Advances in Neural Information Processing Systems 19
* called Deep Belief Networks (DBN) 42

3. Deep Learning 
onwards we go…
43

Hierarchies
Efﬁcient
Generalization
Distributed
Sharing
Unsupervised*
Black Box
Training Time
Major PWNAGE!
Much Data
Why go Deep ?
45

No More Handcrafted Features !
46

— Andrew Ng
“I’ve worked all my life in
Machine Learning, and I’ve
never seen one algorithm knock
over benchmarks like Deep
Learning”
Deep Learning: Why?
47

Biological Justiﬁcation
Deep Learning = Brain “inspired” 
Audio/Visual Cortex has multiple stages == Hierarchical
• Computational Biology • CVAP
• Jorge Dávila-Chacón
• “that guy”
“Brainiacs” “Pragmatists”vs
48

Different Levels of Abstraction
49

Hierarchical Learning
• Natural progression
from low level to high
level structure as seen
in natural complexity
Feature Representation
50

• Easier to monitor what
is being learnt and to
guide the machine to
better subspaces
51

better subspaces
• A good lower level
representation can be
used for many distinct
tasks
52

better subspaces
• A good lower level
representation can be
used for many distinct
tasks
53

• Shared Low Level
Representations
• Multi-Task Learning
• Unsupervised Training
Generalizable Learning
54

• Shared Low Level
Representations
• Multi-Task Learning
• Unsupervised Training
• Partial Feature Sharing
• Mixed Mode Learning
• Composition of
Functions
Generalizable Learning
55

Classic Deep Architecture
Input layer
Hidden layers
Output layer
56

Modern Deep Architecture
Input layer
Hidden layers
Output layer
57

Deep Learning: Why? (again)
Beat state of the art in many areas:
• Language Modeling (2012, Mikolov et al)
• Image Recognition (Krizhevsky won
2012 ImageNet competition)
• Sentiment Classiﬁcation (2011, Socher et
al)
• Speech Recognition (2010, Dahl et al)
• MNIST hand-written digit recognition
(Ciresan et al, 2010)
58

One Model rules them all ? 
 
DL approaches have been successfully applied to:
Deep Learning: Why for NLP ?
Automatic summarization Coreference resolution Discourse analysis
Machine translation Morphological segmentation Named entity recognition (NER)
Natural language generation
Natural language understanding
Optical character recognition (OCR)
Part-of-speech tagging
Parsing
Question answering
Relationship extraction
sentence boundary disambiguation
Sentiment analysis
Speech recognition
Speech segmentation
Topic segmentation and recognition
Word segmentation
Word sense disambiguation
Information retrieval (IR)
Information extraction (IE)
Speech processing
59

- COFFEE BREAK -
after the break we return with: CODE
Download the code samples already now from:
https://github.com/graphiﬁc/DL-Meetup-intro
http://goo.gl/abX1E2shortened url:   60

• Deep Neural Network
• Multilayer Perceptron (MLP) or Artiﬁcial Neural
Network (ANN)
1. MLP
Logistic regression
Training regime:  
Stochastic Gradient Descent (SGD) with minibatches
MNIST dataset
Simple hidden layer
61

2. Convolutional Neural Network
62
from: Krizhevsky, Sutskever, Hinton. (2012). ImageNet Classiﬁcation with Deep Convolutional Neural Networks
[breakthrough in object recognition, Imagenet 2012]

Convolutional Neural Network
http://uﬂdl.stanford.edu/wiki/index.php/
Feature_extraction_using_convolution
movie time:
http://www.cs.toronto.edu/~hinton/adi/index.htm
63

Thats it, no more code! (for now)
64

Deep Learning: Future Developments
Currently an explosion of developments
• Hessian-Free networks (2010)
• Long Short Term Memory (2011)
• Large Convolutional nets, max-pooling (2011)
• Nesterov’s Gradient Descent (2013)
Currently state of the art but...
• No way of doing logical inference (extrapolation)
• No easy integration of abstract knowledge
• Hypothetic space bias might not conform with reality
65

Deep Learning: Future Challenges
a
66
Szegedy, C., Wojciech, Z., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. (2013) Intriguing
properties of neural networks
L: correctly identiﬁed, Center: added noise x10, R: “Ostrich”

• cuda-convnet2 (Alex Krizhevsky, Toronto) (c++/
CUDA, optimized for GTX 580)  
https://code.google.com/p/cuda-convnet2/
• Caffe (Berkeley) (Cuda/OpenCL, Theano, Python) 
http://caffe.berkeleyvision.org/
• OverFeat (NYU)  
http://cilvr.nyu.edu/doku.php?id=code:start
Wanna Play ?

• Theano - CPU/GPU symbolic expression compiler in
python (from LISA lab at University of Montreal). http://
deeplearning.net/software/theano/
• Pylearn2 - library designed to make machine learning
research easy. http://deeplearning.net/software/pylearn2/
• Torch - Matlab-like environment for state-of-the-art
machine learning algorithms in lua (from Ronan Collobert,
Clement Farabet and Koray Kavukcuoglu) http://torch.ch/
• more info: http://deeplearning.net/software links/
Wanna Play ?
Wanna Play ?

as PhD candidate KTH/CSC:
“Always interested in discussing
Machine Learning, Deep
Architectures, Graphs, and
Language Technology”
In touch!
roelof@kth.se
www.csc.kth.se/~roelof/
Internship / EntrepeneurshipAcademic/Research
as CIO/CTO Feeda:
“Always looking for additions to our  
brand new R&D team” 
 
[Internships upcoming on  
KTH exjobb website…]
roelof@feeda.com
www.feeda.com
Feeda
69

Were Hiring!
roelof@feeda.com
www.feeda.com
Feeda
• Dev Ops
• Software Developers
• Data Scientists
70

Thanks for listening
Mingling time!
71

72
Can’t get enough?
Come to my talk Tomorrow (friday)
Description on KTH website
Visual-Semantic Embeddings:  
some thoughts on Language
Roelof Pieters TCS/CSC
Friday jan 23 13:30.
Room 304, Teknikringen 14 level 3

Appendum
Some of the exciting recent developments in NLP 
especially Distributed Semantics
73

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/74

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/75

Word Embeddings: Collobert & Weston (2011)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) .
Natural Language Processing (almost) from Scratch
76

Multi-embeddings: Stanford (2012)
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng  
Improving Word Representations via Global Context and Multiple Word Prototypes
77

Linguistic Regularities: Mikolov (2013)
code & info: https://code.google.com/p/word2vec/
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations
78

Word Embeddings for MT: Mikolov (2013)
Mikolov, T., Le, V. L., Sutskever, I. (2013) . Exploiting Similarities among Languages for Machine Translation
79

Recursive Deep Models & Sentiment: Socher (2013)
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chris Manning, Andrew Ng and Chris Potts. 2013. Recursive
Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP 2013
code & demo: http://nlp.stanford.edu/sentiment/index.html80

Paragraph Vectors: Le & Mikolov (2014)
Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents
81
• add context (sentence, paragraph, document) to word
vectors during training
!
Results on Stanford Sentiment  
Treebank dataset:

Global Vectors, GloVe: Stanford (2014)
Pennington, P., Socher, R., Manning,. D.M. (2014). GloVe: Global Vectors for Word Representation
code & demo: http://nlp.stanford.edu/projects/glove/
vs
results on the word analogy task
“similar accuracy”
82

Dependency-based Embeddings: Levy & Goldberg (2014)
Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings
code & demo: https://levyomer.wordpress.com/2014/04/25/
dependency-based-word-embeddings/
- Syntactic Dependency Context
Australian scientist discovers star with telescope
- Bag of Words (BoW) Context
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$
Precision$
Recall$
“Dependency-based
embeddings have more
functional
similarities”
83

Deep Learning, an interactive introduction for NLP-ers

More Related Content

What's hot

Viewers also liked

Similar to Deep Learning, an interactive introduction for NLP-ers

More from Roelof Pieters

Deep Learning, an interactive introduction for NLP-ers