3. Can we understand Language ?
1. Language is ambiguous:
Every sentence has many possible interpretations.
2. Language is productive:
We will always encounter new words or new
constructions
3. Language is culturally specific
Some of the challenges in Language Understanding:
3
4. Can we understand Language ?
1. Language is ambiguous:
Every sentence has many possible interpretations.
2. Language is productive:
We will always encounter new words or new
constructions
• plays well with others
VB ADV P NN
NN NN P DT
• fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
• the students went to class
DT NN VB P NN
4
Some of the challenges in Language Understanding:
5. Can we understand Language ?
1. Language is ambiguous:
Every sentence has many possible interpretations.
2. Language is productive:
We will always encounter new words or new
constructions
5
Some of the challenges in Language Understanding:
7. Can we understand Language ?
1. Language is ambiguous:
Every sentence has many possible interpretations.
2. Language is productive:
We will always encounter new words or new
constructions
3. Language is culturally specific
Some of the challenges in Language Understanding:
7
8. ML: Traditional Approach
1. Gather as much LABELED data as you can get
2. Throw some algorithms at it (mainly put in an SVM and
keep it at that)
3. If you actually have tried more algos: Pick the best
4. Spend hours hand engineering some features / feature
selection / dimensionality reduction (PCA, SVD, etc)
5. Repeat…
For each new problem/question::
8
9. Machine Learning for NLP
Data
Classic Approach: Data is fed into a learning algorithm:
Learning
Algorithm
9
10. Machine Learning for NLP
some of the (many) treebank datasets
source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks
!
10
12. • the students went to class
DT NN VB P NN
• plays well with others
VB ADV P NN
NN NN P DT
• fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
With a lot of issues:
Penn Treebank
12
13. Machine Learning for NLP
Learning
Algorithm
Data
“Features”
Prediction
Prediction/
Classifier
train set
test set
13
14. Machine Learning for NLP
Learning
Algorithm
“Features”
Prediction
Prediction/
Classifier
train set
test set
14
15. One Model rules them all ?
DL approaches have been successfully applied to:
Deep Learning: Why for NLP ?
Automatic summarization Coreference resolution Discourse analysis
Machine translation Morphological segmentation Named entity recognition (NER)
Natural language generation
Natural language understanding
Optical character recognition (OCR)
Part-of-speech tagging
Parsing
Question answering
Relationship extraction
sentence boundary disambiguation
Sentiment analysis
Speech recognition
Speech segmentation
Topic segmentation and recognition
Word segmentation
Word sense disambiguation
Information retrieval (IR)
Information extraction (IE)
Speech processing
15
18. • What is the meaning of a word?
(Lexical semantics)
• What is the meaning of a sentence?
([Compositional] semantics)
• What is the meaning of a longer piece of text?
(Discourse semantics)
Semantics: Meaning
18
19. • NLP treats words mainly (rule-based/statistical
approaches at least) as atomic symbols:
• or in vector space:
• also known as “one hot” representation.
• Its problem ?
Word Representation
Love Candy Store
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]
Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] AND
Store [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !
19
23. • Language models define probability distributions
over (natural language) strings or sentences
• Joint and Conditional Probability
Language Model
23
24. • Language models define probability distributions
over (natural language) strings or sentences
Language Model
24
25. • Language models define probability distributions
over (natural language) strings or sentences
Language Model
25
26. Word senses
What is the meaning of words?
• Most words have many different senses:
dog = animal or sausage?
How are the meanings of different words related?
• - Specific relations between senses:
Animal is more general than dog.
• - Semantic fields:
money is related to bank
26
27. Word senses
Polysemy:
• A lexeme is polysemous if it has different related
senses
• bank = financial institution or building
Homonyms:
• Two lexemes are homonyms if their senses are
unrelated, but they happen to have the same spelling
and pronunciation
• bank = (financial) bank or (river) bank
27
28. Word senses: relations
Symmetric relations:
• Synonyms: couch/sofa
Two lemmas with the same sense
• Antonyms: cold/hot, rise/fall, in/out
Two lemmas with the opposite sense
Hierarchical relations:
• Hypernyms and hyponyms: pet/dog
The hyponym (dog) is more specific than the hypernym
(pet)
• Holonyms and meronyms: car/wheel
The meronym (wheel) is a part of the holonym (car)
28
29. Distributional representations
“You shall know a word by the company it keeps”
(J. R. Firth 1957)
One of the most successful ideas of modern
statistical NLP!
these words represent banking
• Hard (class based) clustering models
• Soft clustering models
29
30. Distributional hypothesis
He filled the wampimuk, passed it
around and we all drunk some
We found a little, hairy wampimuk
sleeping behind the tree
(McDonald & Ramscar 2001)
30
33. Distributional representations
• Taking it further:
• Continuous word embeddings
• Combine vector space semantics with the
prediction of probabilistic models
• Words are represented as a dense vector:
Candy =
33
34. Word Embeddings: SocherVector Space Model
adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
34
35. Word Embeddings: SocherVector Space Model
adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born
35
36. • Can theoretically (given enough units) approximate
“any” function
• and fit to “any” kind of data
• Efficient for NLP: hidden layers can be used as word
lookup tables
• Dense distributed word vectors + efficient NN
training algorithms:
• Can scale to billions of words !
Why Neural Networks for NLP?
36
37. • Representation of words as continuous vectors has a
long history (Hinton et al. 1986; Rumelhart et al. 1986;
Elman 1990)
• First neural network language model: NNLM (Bengio
et al. 2001; Bengio et al. 2003) based on earlier ideas of
distributed representations for symbols (Hinton 1986)
How?
37
38. Word Embeddings: SocherVector Space Model
Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born ?
…
38
39. Compositionality
Principle of compositionality:
the “meaning (vector) of a
complex expression (sentence)
is determined by:
— Gottlob Frege
(1848 - 1925)
- the meanings of its constituent
expressions (words) and
- the rules (grammar) used to
combine them”
39
40. • How do we handle the compositionality of language in
our models?
40
Compositionality
41. • How do we handle the compositionality of language in
our models?
• Recursion :
the same operator (same parameters) is
applied repeatedly on different components
41
Compositionality
42. • How do we handle the compositionality of language in
our models?
• Option 1: Recurrent Neural Networks (RNN)
42
RNN 1: Recurrent Neural Networks
43. • How do we handle the compositionality of language in
our models?
• Option 2: Recursive Neural Networks (also
sometimes called RNN)
43
RNN 2: Recursive Neural Networks
44. • achieved SOTA in 2011 on
Language Modeling (WSJ AR
task) (Mikolov et al.,
INTERSPEECH 2011):
• and again at ASRU 2011:
44
Recurrent Neural Networks
“Comparison to other LMs shows that RNN
LMs are state of the art by a large margin.
Improvements inrease with more training data.”
“[ RNN LM trained on a] single core on 400M words in a few days,
with 1% absolute improvement in WER on state of the art setup”
Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)
Recurrent neural network based language model
45. 45
Recurrent Neural Networks
(simple recurrent
neural network for LM)
input
hidden layer(s)
output layer
+ sigmoid activation function
+ softmax function:
Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)
Recurrent neural network based language model
48. • Recursive Neural
Network for LM (Socher
et al. 2011; Socher
2014)
• achieved SOTA on new
Stanford Sentiment
Treebank dataset (but
comparing it to many
other models):
Recursive Neural Network
48
Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
info & code: http://nlp.stanford.edu/sentiment/
49. Recursive Neural Tensor Network
49
code & info: http://www.socher.org/index.php/Main/
ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks
Socher, R., Liu, C.C., NG, A.Y., Manning, C.D. (2011)
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
51. • RNN (Socher et al.
2011a)
Recursive Neural Network
51
Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
info & code: http://nlp.stanford.edu/sentiment/
52. • RNN (Socher et al.
2011a)
• Matrix-Vector RNN
(MV-RNN) (Socher et
al., 2012)
Recursive Neural Network
52
Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
info & code: http://nlp.stanford.edu/sentiment/
53. • RNN (Socher et al.
2011a)
• Matrix-Vector RNN
(MV-RNN) (Socher et
al., 2012)
• Recursive Neural
Tensor Network (RNTN)
(Socher et al. 2013)
Recursive Neural Network
53
54. • negation detection:
Recursive Neural Network
54
Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
info & code: http://nlp.stanford.edu/sentiment/
58. NP
IN
NP
DT NN PRP NN
PP
NP (S / ROOT)
“rules” “meanings”
Compositionality
58
Recurrent NN: CompositionalityRecurrent NN for Vector Space
59. Vector Space + Word Embeddings: Socher
59
Recurrent NN: CompositionalityRecurrent NN for Vector Space
60. Vector Space + Word Embeddings: Socher
60
Recurrent NN for Vector Space
61. Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/61
62. Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/
62
63. Word Embeddings: Collobert & Weston (2011)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) .
Natural Language Processing (almost) from Scratch
63
64. Multi-embeddings: Stanford (2012)
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012)
Improving Word Representations via Global Context and Multiple Word Prototypes
64
65. Linguistic Regularities: Mikolov (2013)
code & info: https://code.google.com/p/word2vec/
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations
65
66. Word Embeddings for MT: Mikolov (2013)
Mikolov, T., Le, V. L., Sutskever, I. (2013) .
Exploiting Similarities among Languages for Machine Translation
66
68. Recursive Deep Models & Sentiment: Socher (2013)
Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013)
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.
code & demo: http://nlp.stanford.edu/sentiment/index.html
68
69. Paragraph Vectors: Le & Mikolov (2014)
Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents
69
• add context (sentence, paragraph, document) to word
vectors during training
!
Results on Stanford Sentiment
Treebank dataset:
73. Global Vectors, GloVe: Stanford (2014)
Pennington, P., Socher, R., Manning,. D.M. (2014).
GloVe: Global Vectors for Word Representation
code & demo: http://nlp.stanford.edu/projects/glove/
vs
results on the word analogy task
“similar accuracy”
73
74. Dependency-based Embeddings: Levy & Goldberg (2014)
Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings
code & demo: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-
embeddings/
- Syntactic Dependency Context
Australian scientist discovers star with telescope
- Bag of Words (BoW) Context
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$
Precision$
Recall$
“Dependency-based
embeddings have more
functional
similarities”
74
79. Attention
Gregor et al (2015) DRAW: A Recurrent Neural Network For Image
Generation (arxiv) (code)
80. • Question-Answering Systems (&Memory)
• Summarization
• Text Generation
• Dialogue Systems
• Image Captioning & other multimodal tasks
Wanna Play ?
Recent breakthroughs
80
81. • Question-Answering Systems (&Memory)
• Summarization
• Text Generation
• Dialogue Systems
• Image Captioning & other multimodal tasks
Wanna Play ?
Recent breakthroughs
81
82. Wanna Play ?
QA & Memory
82
• Memory Networks (Weston et al 2015)
• Dynamic Memory Network (Kumar et al 2015)
• Neural Turing Machine (Graves et al 2014)
Facebook
Metamind
DeepMind
Weston et al (2015) Memory Networks (arxiv)
83. QA & Memory
83
Yyer et al. (2014) A Neural Network for Factoid Question Answering over
Paragraphs (paper)
84. Wanna Play ?
QA & Memory
84
• Memory Networks (Weston et al 2015)
• Dynamic Memory Network (Kumar et al 2015)
• Neural Turing Machine (Graves et al 2014)
Facebook
Metamind
DeepMind
Zaremba & Sutskever (2015) Learning to Execute (arxiv)
93. Image-Captioning
• Andrej Karpathy Li Fei-Fei , 2015.
Deep Visual-Semantic Alignments for Generating Image Descriptions (pdf) (info) (code)
• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan , 2015. Show and Tell: A
Neural Image Caption Generator (arxiv)
• Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention (arxiv) (info) (code)
94. “A person riding a motorcycle on a dirt road.”???
Image-Captioning
96. “A stop sign is flying in blue skies.”
“A herd of elephants flying in the blue skies.”
Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan
Salakhutdinov, 2015. Generating Images from Captions
with Attention (arxiv) (examples)
Image-Captioning
97. • TensorFlow: Recently released library by Google.
http://tensorflow.org
• Theano - CPU/GPU symbolic expression compiler in python (from
LISA lab at University of Montreal). http://deeplearning.net/software/
theano/
• Caffe - Computer Vision oriented Deep Learning framework:
caffe.berkeleyvision.org
• Torch - Matlab-like environment for state-of-the-art machine learning
algorithms in lua (from Ronan Collobert, Clement Farabet and Koray
Kavukcuoglu) http://torch.ch/
• more info: http://deeplearning.net/software links/
Wanna Play ? General Deep Learning
97
98. • RNNLM (Mikolov)
http://rnnlm.org
• NB-SVM
https://github.com/mesnilgr/nbsvm
• Word2Vec (skipgrams/cbow)
https://code.google.com/p/word2vec/ (original)
http://radimrehurek.com/gensim/models/word2vec.html (python)
• GloVe
http://nlp.stanford.edu/projects/glove/ (original)
https://github.com/maciejkula/glove-python (python)
• Socher et al / Stanford RNN Sentiment code:
http://nlp.stanford.edu/sentiment/code.html
• Deep Learning without Magic Tutorial:
http://nlp.stanford.edu/courses/NAACL2013/
Wanna Play ? NLP
98