Deep Learning for Information Retrieval: Models, Progress, & Opportunities

Deep Learning for Information Retrieval:
Models, Progress, & Opportunities
Matt Lease
School of Information (“iSchool”) @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease

“The place where people & technology meet”
~ Wobbrock et al., 2009
iSchools now exist at many universities around the world
www.ischools.org
What’s an Information School (iSchool)?
2

Why Is That Relevant? Collecting Annotator
Rationales for Relevance Judgments
T. McDonnell, M. Lease, M. Kutlu, & T. Elsayed
Best Paper Award, HCOMP 2016
3

Deep (a.k.a. Neural) IR
@mattlease

Growing Interest in “Deep” IR
• Success of Deep Learning (DL) in other fields
– Speech recognition, computer vision, & NLP
• Growing presence of DL in IR research
– e.g., SIGIR 2016 Keynote, Tutorial, & Workshop
• Adoption by industry
– Bloomberg: Google Turning Its Lucrative Web Search
Over to AI Machines. October, 2015
– WIRED: AI is Transforming Google Search.
The Rest of the Web is next. February, 2016.
5https://en.wikipedia.org/wiki/RankBrain

But Does IR Need Deep Learning?
• Chris Manning (Stanford)’s SIGIR Keynote:
“I’m certain that deep learning will come
to dominate SIGIR over the next couple of
years... just like speech, vision, and NLP before it.”
• Despite great successes on short texts, longer texts
typical of ad-hoc search remain more problematic,
with only recent success (e.g., Guo et al., 2016)
• As Hang Li eloquently put it, “Does IR (Really) Need
Deep Learning?” (SIGIR 2016 Neu-IR workshop)
6

Neural Information Retrieval:
A Literature Review
Ye Zhang et al.
https://arxiv.org/abs/1611.06792
Posted 18 November, 2016
7

A Few Notes
• Scope of our Literature Review
– We focus on the recent “third wave” of NN research,
excluding earlier NN studies
– We surveyed papers up through CIKM 2016
– We welcome pointers to any missed studies! 
• Terminology: “Neural” IR (much work is not ‘deep’!)
• Not all neural networks are ‘deep’
• Not all ‘deep’ models are neural
• In practice, “deep learning” & “neural” often used interchangeably
8

• Word Embeddings
• Extending IR Models via Word Embeddings
• Discussion
• Toward End-to-End Neural IR Architectures
• Future Outlook
• Resources
Roadmap for Talk
9
Slides:
slideshare.net/mattlease

Traditional “one-hot” word encoding
Leads to famous term mismatch problem in IR
11slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial

Distributional Representations
Define words by their co-occurrence signatures
12slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial

Popular Word Embeddings Today
• word2vec (Mikolov et al., 2013) – sliding window
– CBOW: predict center word given window context
– Skip-gram: predict context given center word
• See also: GloVe (Pennington et al., 2014)
– Matrix factorization 13
deeplearning4j.org/
word2vec

Longer History, Other Alternatives
• Clinchant and Perronnin (2013) use classic LSI
(Deerwester et al., 1990), then convert to fixed-
length Fisher Vectors (FVs)
• Lioma et al. (2015) build on Kiela and Clark (2013)’s
prior work in distributional semantics
• Hyperspace Analogue to Language (HAL) (Lund and
Burgess, 1996); see also (Bruza and Song, 2002)
– Probabilistic HAL (Azzopardi et al., 2005)
– Zuccon et al. (2015) compare HAL vs. word2vec
14

Active Discriminative Text
Representation Learning
Joint work with
Zhang, Lease, & Wallace, AAAI 2017
15

Active Discriminative Text Representation Learning
Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212
• Idea: Select next item to label to first optimize feature representation
(i.e. word embeddings) before optimizing model to use these features
• Approach: Expected Gradient Length (EGL) , sentences vs. documents
– EGL-word: Take expected gradient wrt. embeddings only
– EGL-sm: Take expected gradient wrt. softmax layer parameters only
– EGL-word-doc: normalize each word’s gradient by its DF & sum
over the gradients for the top-k words instead of using max only
– EGL-Entropy-Beta: Balance
expected updates to word
gradients (i.e. EGL-word-doc) vs.
instance uncertainty (entropy)
• First focus on embeddings, then
later shift emphasis to entropy
16

Idea: Select next item to label to first optimize feature representation
Results: Sentence Classification
17

18
Idea: Select next item to label to first optimize feature representation
Results: Document Classification

Extending IR Models
with Word Embeddings
@mattlease

Recent IR Work with Word Embeddings
20

Clinchant and Perronnin (2013)
• Precedes word2vec
– Uses classic LSI to induce dist. term representations
– Reduces to fixed-length vectors via Fisher Kernel
– Compares word vectors via cosine
• Consistently worse than DFR baseline
21

Berger & Lafferty (1999)
• IR as Statistical Translation
– Document d contains word w
– w is translated to observed query word q
23

GLM: Ganguly et al., SIGIR 2015
24

GLM: Ganguly et al., SIGIR 2015
25

NTLM: Zuccon et al., ACDS 2015
26

DeepTR: Zheng & Callan, SIGIR 2015
• Supervised learning of effective term weights
– Like RegressionRank (Lease et al., ECIR 2009),
(Lease, SIGIR 2009) but without feature engineering
• Represent each query term in context by
avg. query embedding - term embedding
27

DESM: Mitra et al., arXiv 2016
• "A crucial detail often overlooked… Word2Vec [produces] two
different sets of vectors (…IN and OUT embedding spaces)… By
default, Word2Vec discards OUT ... and outputs only IN... “
• “…IN-IN and OUT-OUT cosine similarities are high for words
…similar by function or type (typical) …IN-OUT cosine similarities
are high between words that often co-occur in the same query
or document (topical).”
• Compute query & document embeddings by avg. over terms
• They map query words to IN and document words to OUT
• Compare training on Bing query log vs. Web corpus 28

BWESG: Vulic & Moens, SIGIR 2015
• As typical, estimate query/document vectors
by simple average of constituent term vectors
• Alternative: use weighted average by each
term’s information-theoretic self-information
– Like IDF, expected to indicate term importance
• More on BWESG later…
29

Diaz, Mitra, & Craswell, ACL 2016
• Learn topical word embeddings at query-time
– New flavor of classic IR global vs. local tradeoff
– Compare use of collection vs. external corpora
• No comparison to pseudo-relevance feedback
30

Zamani & Croft, ICTIR 2016a (Est.)
• Provide theoretical justification for estimating
phrasal vectors by averaging term vectors
• Transform cosine vector similarity scores by
softmax vs. sigmoid (consistently better)
– No regular cosine results reported 
• PQV: weighted average of (expanded) query
word vectors based on PRF
– No regular PRF results reported 
31

Zamani & Croft, ICTIR 2016a (Emb.)
• Propose 2 methods for word embedding-
based query expansion (EQE)
– Vary in independence assumptions, akin to RM1
and RM2 in (Lavrenko & Croft, 2001)
• Propose embedding-based rel. model (ERM)
– Can linearly mix (ML, EQE1, or EQE2) + ERM
• Strong evaluation vs. ML and RM3 baselines
32

Ordentlich et al., CIKM 2016
• Word2vec at Scale in Industry
33

Cross-Lingual IR with
Bilingual Word Embeddings
@mattlease

BWESG: Vulic & Moens, SIGIR 2015
37

Ye et al., ICSE 2016: Finding Bugs
• Given textual bug report (query), find
software files needing to be fixed (documents)
– Saha, Lease, Khursid, Perry (ASE, 2013)
• Augment the Skip-gram model to predict all
code tokens from each text word, and all text
words from each code
• token
38

Word Embeddings: Many Details
• Use word2vec (CBOW or skipgram), GloVe, or something else?
• How to set hyper-parameters and select training data/corpora?
• Can multiple embeddings be selected dynamically or combined?
• Blending BIG out-of-domain data with small in-domain data?
• Tradeoff of off-the-shelf embeddings vs. re-training (fine-tuning
or from scratch) for a target domain?
• How much does task or downstream architecture matter?
• How to handle out-of-vocabulary (OOV) query terms?
40

CBOW, SG, GloVe, or …?
• Not clear that any single neural embedding or set
of embeddings performs best in all cases
• Neural vs. Traditional distributed representations?
– Zuccon et al. (2015): “it is not clear yet whether neural
inspired models are generally better than traditional
distributional semantic methods.”
• Models that jointly exploit multiple sets of
embeddings may be worth further pursuing…
– Zhang et al., 2016b
– Neelakantan et al., 2014
41

Which Training Data to Use?
• Zuccon et al. (2015): “the choice of corpus used to construct
word embeddings had little effect on retrieval results.”
• Zamani and Croft (2016b) train GloVe on three external
corpora and report, “there is no significant differences
between the values obtained by employing different corpora
for learning the embedding vectors.”
• Zheng and Callan (2015) : “[the system] performed equally
well with all three external corpora… although no external
corpus was best for all datasets... corpus-specific word
vectors were never best... given the wide range of training
data sizes… from 250 million words to 100 billion words – it is
striking how little correlation there is between search
accuracy and the amount of training data.”
42

Training Embeddings Across Genres
• Query logs (Mitra et al., Sordoni et al.)
• Community Q&A (Zhou et al., 2015)
• Venue comments (Manotumruska et al., 2016)
• Medical texts (De Vine et al., 2014)
• Program. Lang & Comments (Ye et al., ICSE 2016)
• Knowledge Base (Nguyen et al., Neu-IR 2016)
43

Global vs. Local, revisited
• Global word embeddings, trained without reference
to queries, vs. local methods like PRF for exploiting
query-context, appear similarly limited as past
approaches such as topic modeling
– e.g., Yi & Allan (2009) compare topic modeling vs. PRF
• When Neural IR has helped ad-hoc search,
improvements seem modest compared to
known query expansion techniques (e.g. PRF)
• Diaz et al. (2016) learn topic-specific embeddings
44

Handling OOV Query Terms
• Easy option: ignore (some have done this)
– User might not be happy…
• Use unique random embedding for each OOV
– If the same term appears in query & document,
will match and contribute toward score
– Unlikely to yield close matches with other terms
• Misspellings and social spellings (e.g. “kool”)
– Standardize or use character-based model
45

Going Beyond Bag-of-Words
• How to represent longer units of text?
– Simple answer: average word embedding vectors
• Ganguly et al. (2016):
– “... compositionality of the word vectors [only] works
well when applied over a relatively small number of
words... [and] does not scale well for a larger unit of
text, such as passages or full documents, because of
the broad context present within a whole document.”
• Common Future Work: embedding phrases…
46

Measuring Textual Similarity
• Simplest: average vectors & take cosine
• Ganguly et al (2016): document is mixture of Gaussians,
word embeddings are samples
• Zamani & Croft (2016): sigmoid and softmax
transformations of cosine similarity
• Kenter & de Rijke (2015): BM25 extension to incorporate
word embeddings
• Kusner et al. (2015): word movers distance (WMD)
– Kim et al. (2016): WMD for query-document distance
• Fisher Kernel approaches: Zhang et al. (2014), Clinchant &
Perronnin (2013), Zhou et al. (2015)
47

Toward End-to-End
Neural IR Architectures
@mattlease

End-to-End Representation Learning
vs. Feature Engineering
• e.g., CDNN: Severyn & Moschitti, SIGIR 2015
50

DSSM: Huang et al., CIKM 2013
51

CLSM: Shen et al., CIKM 2014
52

DRRM: Guo et al., CIKM 2016
• Supervised re-ranking of top 2K QL results
53

Gupta et al., SIGIR 2014
• Mixed-script IR (Hindi-English)
• Using FIRE 2013 data
54

Cohen et al., Neu-IR 2016
• Compare performance of deep and traditional
models across texts of varying lengths
• Deep often better when text is short
55

Looking for Gains (in all the wrong places)?
• Much Neural IR work to date has investigated traditional document
retrieval (e.g. ad-hoc), seeking improved retrieval accuracy
• This framing may be too narrow
– e.g., Hang Li’s 2016 Neu-IR talk on other search scenarios
– IMHO: We have already invested decades in heavily optimizing vector
representations of queries & documents for matching, including a
many approaches for addressing term mismatch – strong baselines!
• The “real” strength of Neural IR may lie elsewhere, in enabling a
new generation of search scenarios and modalities, such as
– Search via conversational agents (Yan et al., 2016)
– Multi-modal retrieval (Ma et al., 2015c,a)
– Knowledge-based search IR (Nguyen et al., 2016)
– Synthesis of relevant documents (Lioma et al., 2016)
– Future search scenarios, yet to be identified & investigated
57

Industrial Research vs. Academic Research
• With efficacy driven by “big data”, perhaps
massive query logs will be needed to realize
Neural IR’s true potential?
• Will deep learning further divide industrial vs.
academic research?
58

Supervised vs. Unsupervised Deep Learning
• e.g. Supervised learning-to-rank (Liu, 2009) vs.
Unsupervised language or query modeling: Mitra &
Craswell (2015); Mitra (2015); Sordoni et al. (2015)
• LeCun et al. (Nature, 2015) wrote, “we expect
unsupervised learning to become far more
important in the longer term”
• The rise of the Web drove unsupervised and semi-
supervised approaches by the vast unlabeled data it
made available
– Neural IR may best succeed where the biggest data is
naturally found, e.g., private commercial search logs
& public Web content 59

Going Deeper with Characters
“The dominant approach for many NLP tasks are recurrent neural networks,
in particular LSTMs, and convolutional neural networks. However, these
architectures are rather shallow in comparison to the deep convolutional
networks which are very successful in computer vision.
We present a new architecture for text processing which operates directly on
the character level and uses only small convolutions and pooling operations.
We are able to show that the performance of this model increases with the
depth: using up to 29 convolutional layers, we report significant
improvements over the state-of-the-art on several public text classification
tasks. To the best of our knowledge, this is the first time that very deep
convolutional nets have been applied to NLP.”
60

Resources
@mattlease
http://deeplearning.net

Neural Information Retrieval:
A Literature Review
Ye Zhang et al.
Posted 18 November, 2016
62

Neural IR Source Code Released
63

Matt Lease - ml@utexas.edu - @mattlease
Thank You!
UT Austin IR Lab: ir.ischool.utexas.edu
Slides: slideshare.net/mattlease

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

More Related Content

What's hot

Similar to Deep Learning for Information Retrieval: Models, Progress, & Opportunities

More from Matthew Lease

Recently uploaded

Deep Learning for Information Retrieval: Models, Progress, & Opportunities