Deep Learning for Information Retrieval

@graphiﬁc
Roelof Pieters
Guest
Lecture:
Deep
Learning

for
Informa8on
Retrieval
28
April
2015
www.csc.kth.se/~roelof/
roelof@kth.se
roelof@graph-technologies.com
Gve Systems
Graph Technologies R&D
DD2476 Search Engines and Information Retrieval Systems
https://www.kth.se/social/course/DD2476/
slides
online
at
 
h4p://www.slideshare.net/roelofp/deep-‐learning-‐for-‐informa=on-‐retrieval

2
About Me
• (-10y) CS dropout (Amsterdam Technical Univ.)
• (2y) Msc Social Anthropology, Stockholm
University
• Current: PhD candidate at KTH/CSC with focus
on:
• Deep Learning for Natural Language
Processing (Distributed Semantics)
• Graph-based approaches for Knowledge
Representation
• Multi-modal models
• Current: Data Science Consultant at Graph
Technologies RD & Gve-Systems
• Recommender Systems
• Deep Learning
• Realtime Graph-based Search Engines

3
Information Retrieval (IR)
- Hedvig Kjellström, lecture 1

4
Data landscape is changing
1. Amount of digital data
is growing at
increasing rate (IOT,
digitalization,
wearables, phones/
tablets)
2. Data types are shifting as well:
1. from text to audio-visual
2. from professional to personal/social (social media)
3. from semi-structured to unstructured

[Jussi Karlgren, NLP Sthlm Meetup 2014]

6
Data landscape is changing
Triple V’s of Big Data:
1. Volume
2. Velocity
3. Variety

7
Making sense of Data
Typical ML Regression

8
Making sense of Data
Neural NetTypical ML Regression

Degrees of Complexity
9
perceptron demo

Neural Net
10
(ﬁgure from Lior Rokach, Ben-Gurion University)

Neural Net
11

Neural Net
12

Neural Net
13

Neural Net
14

Neural Net
15
multilayer nn demo

Deep Learning ??
17
• Learning multiple layers
• “Back propagation”
• Can “theoretically” learn any function!
Prior to 2006:
• Very slow and ineﬃcient
• SVMs, random forests, etc. SOTA

18
2006+: the 3 Deep Learning Conspirators

— Andrew Ng
“I’ve worked all my life in
Machine Learning, and I’ve
never seen one algorithm knock
over benchmarks like Deep
Learning”
Deep Learning: Why?
21

Different Levels of Abstraction
22

Hierarchical Learning
• Natural progression
from low level to high
level structure as seen
in natural complexity
Feature Representation
23

• Easier to monitor what
is being learnt and to
guide the machine to
better subspaces
24

better subspaces
• A good lower level
representation can be
used for many distinct
tasks
25

2626
better subspaces
• A good lower level
representation can be
used for many distinct
tasks

Classic Deep Architecture
Input layer
Hidden layers
Output layer
27

Modern Deep Architecture
Input layer
Hidden layers
Output layer
movie time:
http://www.cs.toronto.edu/~hinton/adi/index.htm
28

[Kudos to Richard Socher, for this eloquent summary :) ]
• Manually designed features are often over-specified, incomplete
and take a long time to design and validate
• Learned Features are easy to adapt, fast to learn
• Deep learning provides a very flexible, (almost?) universal,
learnable framework for representing world, visual and
linguistic information.
• Deep learning can learn unsupervised (from raw text/audio/
images/whatever content) and supervised (with specific labels
like positive/negative)
Why Deep Learning ?
29

31
What about NLP ?
1. Language is ambiguous: 
Every sentence has many possible interpretations.
2. Language is productive: 
We will always encounter new words or new
constructions
3. Language is culturally speciﬁc
Some of the challenges in Language Understanding:

• NLP treats words mainly (rule-based/statistical
approaches at least) as atomic symbols: 
• or in vector space: 
• also known as “one hot” representation.
• Its problem ?
Language Representation
Love Candy Store
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]
Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] AND
Store [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !
32

Language Representation
33
- Johan Boye, lecture 2
Term-document matrix = Sparse!

Distributional representations
“You shall know a word by the company it keeps” 
(J. R. Firth 1957)
One of the most successful ideas of modern
statistical NLP!
these words represent banking
• Hard (class based) clustering models
• Soft clustering models
34

Distributional hypothesis
He ﬁlled the wampimuk, passed it
around and we all drunk some
We found a little, hairy wampimuk
sleeping behind the tree
(McDonald & Ramscar 2001)
35

Distributional semantics
Landauer and Dumais (1997), Turney and Pantel (2010), …
36

Distributional semantics
Distributional meaning as co-occurrence vector:
37

Distributional representations
• Taking it further:
• Continuous word embeddings
• Combine vector space semantics with the
prediction of probabilistic models
• Words are represented as a dense vector:
Candy =
38

Word Embeddings: SocherVector Space Model
adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
39

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born
40

• Can theoretically (given enough units) approximate
“any” function
• and fit to “any” kind of data
• Efficient for NLP: hidden layers can be used as word
lookup tables
• Dense distributed word vectors + efficient NN
training algorithms:
• Can scale to billions of words !
Why Neural Networks for NLP?
41

Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born ?
…
42

Compositionality
Principle of compositionality:
the “meaning (vector) of a
complex expression (sentence)
is determined by:
— Gottlob Frege  
(1848 - 1925)
- the meanings of its constituent
expressions (words) and
- the rules (grammar) used to
combine them”
43

• How do we handle the compositionality of language in
our models?
44
Compositionality

our models?
• Recursion : 
the same operator (same parameters) is
applied repeatedly on diﬀerent components
45
Compositionality

our models?
• Option 1: Recurrent Neural Networks (RNN)
46
RNN 1: Recurrent Neural Networks
(we ignore recurrent NN’s for this talk)

our models?
• Option 2: Recursive Neural Networks (also
sometimes called RNN)
47
RNN 2: Recursive Neural Networks

Recursive Neural Tensor Network
48

Recursive Neural Tensor Network
49
code & info: http://www.socher.org/index.php/Main/
ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks
Socher, R., Liu, C.C., NG, A.Y., Manning, C.D. (2011)  
Parsing Natural Scenes and Natural Language with Recursive Neural Networks

NP
PP/IN
NP
DT NN PRP$ NN
Parse Tree
Recurrent NN for Vector Space
50

NP
PP/IN
NP
DT NN PRP$ NN
Parse Tree
INDT NN PRP NN
Compositionality
51
Recurrent NN: CompositionalityRecurrent NN for Vector Space

NP
IN
NP
PRP NN
Parse Tree
DT NN
Compositionality
52

NP
IN
NP
DT NN PRP NN
PP
NP (S / ROOT)
“rules” “meanings”
Compositionality
53

Vector Space + Word Embeddings: Socher
54

Vector Space + Word Embeddings: Socher
55
Recurrent NN for Vector Space

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/56

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/
57

Word Embeddings: Collobert & Weston (2011)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) .
Natural Language Processing (almost) from Scratch
59

Polysemous-embeddings: Stanford (2012)
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012) 
Improving Word Representations via Global Context and Multiple Word Prototypes
60

Linguistic Regularities: Mikolov (2013)
code & info: https://code.google.com/p/word2vec/
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations
61

Word Embeddings for MT: Mikolov (2013)
Mikolov, T., Le, V. L., Sutskever, I. (2013) .  
Exploiting Similarities among Languages for Machine Translation
62

Word Embeddings for MT: Kiros (2014)
Kiros, R., Zemel, R. S., Salakhutdinov, R. (2014) .  
A Multiplicative Model for Learning Distributed Text-Based Attribute Representations
63

Recursive Deep Models & Sentiment: Socher (2013)
Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013)  
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.
code & demo: http://nlp.stanford.edu/sentiment/index.html
64

Paragraph Vectors: Le & Mikolov (2014)
Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents
65
• add context (sentence, paragraph, document) to word
vectors during training
!
Results on Stanford Sentiment  
Treebank dataset:

Paragraph Vectors: Dai et al. (2014)
Dai, A., Olah,. C., Le, Q., Corrado, G. (2014) Document Embedding with Paragraph Vectors
66

67

68
Nearest neighbours to the machine learning paper “Distributed
Representations of Sentences and Documents” in arXiv.

Joint Image-Word Embeddings
69

1. Multimodal representation learning
2. Generating descriptions of images
3. Ranking images and captions (“image-sentence
ranking”)
Some Current Approaches
70

Bags of Visual Words
71
Source credit : K. Grauman, B. Leibe

Bags of Visual Words (Sivic & Zisserman 2003)
standard BoW issues however
What we get:
But we want:
• visual word order/relations
• location
• scale/viewpoint invariance
• …
72

Zero-shot Learning
• skip-gram text model on wikipedia corpus of 5.7 million
documents (5.4 billion words) - approach from (Mikolov
et al. ICLR 2013)
73
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013)  
Devise: A deep visual-semantic embedding model
DeViSE model

Encoder: A deep convolutional network (CNN) and long short-
term memory recurrent network (LSTM) for learning a joint
image-sentence embedding.
Decoder: A new neural language model that combines structure
and content vectors for generating words one at a time in
sequence.
Encoder-Decoder pipeline
74
Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014)  
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
(Kiros et al 2014)

• captures Multimodal linguistic regularities
75

• captures Multimodal linguistic regularities
76
(PCA projection of (300-dimensional) word and image representations)

77
Vinyals, O., Toshev, A., Bengio, S., Erhan. D. (2015)  
Show and Tell: A Neural Image Caption Generator
Joint Visual-Semantic embedding
Karpathy, A., Fei Fei, L. (2015)  
Deep Visual-Semantic Alignments for Generating Image Descriptions
CNN+LSTM
CNN+RNN

78

79

80

81
demo

Download example code samples from
https://github.com/graphiﬁc/DL-Meetup-intro
83
git clone --recursive https://github.com/graphific/
DL-Meetup-intro.git
Wanna Play ? Code!
(more at http://deeplearning.net/ )

• Theano - CPU/GPU symbolic expression compiler in
python (from LISA lab at University of Montreal).
http://deeplearning.net/software/theano/
• Pylearn2 - library designed to make machine learning
research easy. http://deeplearning.net/software/
pylearn2/
• Torch - Matlab-like environment for state-of-the-art
machine learning algorithms in lua (from Ronan
Collobert, Clement Farabet and Koray Kavukcuoglu)
http://torch.ch/
• more info: http://deeplearning.net/software links/
Wanna Play ?
Wanna Play ? General Deep Learning
84

• RNNLM (Mikolov) 
http://rnnlm.org
• NB-SVM 
https://github.com/mesnilgr/nbsvm
• Word2Vec (skipgrams/cbow) 
https://code.google.com/p/word2vec/ (original) 
http://radimrehurek.com/gensim/models/word2vec.html (python)
• GloVe 
http://nlp.stanford.edu/projects/glove/ (original) 
https://github.com/maciejkula/glove-python (python)
• Socher et al / Stanford RNN Sentiment code: 
http://nlp.stanford.edu/sentiment/code.html
• Deep Learning without Magic Tutorial: 
http://nlp.stanford.edu/courses/NAACL2013/
Wanna Play ? NLP
85

• cuda-convnet2 (Alex Krizhevsky, Toronto) (c++/
CUDA, optimized for GTX 580)  
https://code.google.com/p/cuda-convnet2/
• Caﬀe (Berkeley) (Cuda/OpenCL, Theano, Python) 
http://caﬀe.berkeleyvision.org/
• OverFeat (NYU)  
http://cilvr.nyu.edu/doku.php?id=code:start
Wanna Play ? Computer Vision
86

Impact on Computer Vision
(from Clarifai)89

Impact on Audio Processing
Speech Recognition
90

Impact on Audio Processing
TIMIT Speech Recognition
(from: Clarifai)91

C&W 2011
Impact on Natural Language Processing
Pos: Toutanova et al. 
2003)
Ner: Ando & Zhang  
2005
C&W 2011
92

Impact on Natural Language Processing
Named Entity Recognition:
93

Deep Learning for Information Retrieval

More Related Content

What's hot

Similar to Deep Learning for Information Retrieval

More from Roelof Pieters

Recently uploaded

Deep Learning for Information Retrieval