Deep Learning for Information Retrieval:
Models, Progress, & Opportunities
Matt Lease
School of Information (“iSchool”) @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease
“The place where people & technology meet”
~ Wobbrock et al., 2009
iSchools now exist at many universities around the world
www.ischools.org
What’s an Information School (iSchool)?
2
Why Is That Relevant? Collecting Annotator
Rationales for Relevance Judgments
T. McDonnell, M. Lease, M. Kutlu, & T. Elsayed
Best Paper Award, HCOMP 2016
3
Deep (a.k.a. Neural) IR
@mattlease
Growing Interest in “Deep” IR
• Success of Deep Learning (DL) in other fields
– Speech recognition, computer vision, & NLP
• Growing presence of DL in IR research
– e.g., SIGIR 2016 Keynote, Tutorial, & Workshop
• Adoption by industry
– Bloomberg: Google Turning Its Lucrative Web Search
Over to AI Machines. October, 2015
– WIRED: AI is Transforming Google Search.
The Rest of the Web is next. February, 2016.
5https://en.wikipedia.org/wiki/RankBrain
But Does IR Need Deep Learning?
• Chris Manning (Stanford)’s SIGIR Keynote:
“I’m certain that deep learning will come
to dominate SIGIR over the next couple of
years... just like speech, vision, and NLP before it.”
• Despite great successes on short texts, longer texts
typical of ad-hoc search remain more problematic,
with only recent success (e.g., Guo et al., 2016)
• As Hang Li eloquently put it, “Does IR (Really) Need
Deep Learning?” (SIGIR 2016 Neu-IR workshop)
6
Neural Information Retrieval:
A Literature Review
Ye Zhang et al.
https://arxiv.org/abs/1611.06792
Posted 18 November, 2016
7
A Few Notes
• Scope of our Literature Review
– We focus on the recent “third wave” of NN research,
excluding earlier NN studies
– We surveyed papers up through CIKM 2016
– We welcome pointers to any missed studies! 
• Terminology: “Neural” IR (much work is not ‘deep’!)
• Not all neural networks are ‘deep’
• Not all ‘deep’ models are neural
• In practice, “deep learning” & “neural” often used interchangeably
8
• Word Embeddings
• Extending IR Models via Word Embeddings
• Discussion
• Toward End-to-End Neural IR Architectures
• Future Outlook
• Resources
Roadmap for Talk
9
Slides:
slideshare.net/mattlease
Word Embeddings
@mattlease
Traditional “one-hot” word encoding
Leads to famous term mismatch problem in IR
11slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial
Distributional Representations
Define words by their co-occurrence signatures
12slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial
Popular Word Embeddings Today
• word2vec (Mikolov et al., 2013) – sliding window
– CBOW: predict center word given window context
– Skip-gram: predict context given center word
• See also: GloVe (Pennington et al., 2014)
– Matrix factorization 13
deeplearning4j.org/
word2vec
Longer History, Other Alternatives
• Clinchant and Perronnin (2013) use classic LSI
(Deerwester et al., 1990), then convert to fixed-
length Fisher Vectors (FVs)
• Lioma et al. (2015) build on Kiela and Clark (2013)’s
prior work in distributional semantics
• Hyperspace Analogue to Language (HAL) (Lund and
Burgess, 1996); see also (Bruza and Song, 2002)
– Probabilistic HAL (Azzopardi et al., 2005)
– Zuccon et al. (2015) compare HAL vs. word2vec
14
Active Discriminative Text
Representation Learning
Joint work with
Zhang, Lease, & Wallace, AAAI 2017
https://arxiv.org/abs/1606.04212
15
Active Discriminative Text Representation Learning
Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212
• Idea: Select next item to label to first optimize feature representation
(i.e. word embeddings) before optimizing model to use these features
• Approach: Expected Gradient Length (EGL) , sentences vs. documents
– EGL-word: Take expected gradient wrt. embeddings only
– EGL-sm: Take expected gradient wrt. softmax layer parameters only
– EGL-word-doc: normalize each word’s gradient by its DF & sum
over the gradients for the top-k words instead of using max only
– EGL-Entropy-Beta: Balance
expected updates to word
gradients (i.e. EGL-word-doc) vs.
instance uncertainty (entropy)
• First focus on embeddings, then
later shift emphasis to entropy
16
Active Discriminative Text Representation Learning
Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212
Idea: Select next item to label to first optimize feature representation
(i.e. word embeddings) before optimizing model to use these features
Results: Sentence Classification
17
Active Discriminative Text Representation Learning
Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212
18
Idea: Select next item to label to first optimize feature representation
(i.e. word embeddings) before optimizing model to use these features
Results: Document Classification
Extending IR Models
with Word Embeddings
@mattlease
Recent IR Work with Word Embeddings
20
Clinchant and Perronnin (2013)
• Precedes word2vec
– Uses classic LSI to induce dist. term representations
– Reduces to fixed-length vectors via Fisher Kernel
– Compares word vectors via cosine
• Consistently worse than DFR baseline
21
Ponte & Croft (2001): LM for IR
P(D|Q) = [ P(Q|D) P(D) ] / P(Q)
∝ P(Q|D) P(D) for fixed query
∝ P(Q|D) assume uniform P(D)
P(Q|D) = 𝛼 ∗ 𝑃(𝑞|𝐷)𝑞 + 1 − 𝛼 𝑃(𝑞|𝐶)
22
Berger & Lafferty (1999)
• IR as Statistical Translation
– Document d contains word w
– w is translated to observed query word q
23
GLM: Ganguly et al., SIGIR 2015
24
GLM: Ganguly et al., SIGIR 2015
25
NTLM: Zuccon et al., ACDS 2015
26
DeepTR: Zheng & Callan, SIGIR 2015
• Supervised learning of effective term weights
– Like RegressionRank (Lease et al., ECIR 2009),
(Lease, SIGIR 2009) but without feature engineering
• Represent each query term in context by
avg. query embedding - term embedding
27
DESM: Mitra et al., arXiv 2016
• "A crucial detail often overlooked… Word2Vec [produces] two
different sets of vectors (…IN and OUT embedding spaces)… By
default, Word2Vec discards OUT ... and outputs only IN... “
• “…IN-IN and OUT-OUT cosine similarities are high for words
…similar by function or type (typical) …IN-OUT cosine similarities
are high between words that often co-occur in the same query
or document (topical).”
• Compute query & document embeddings by avg. over terms
• They map query words to IN and document words to OUT
• Compare training on Bing query log vs. Web corpus 28
BWESG: Vulic & Moens, SIGIR 2015
• As typical, estimate query/document vectors
by simple average of constituent term vectors
• Alternative: use weighted average by each
term’s information-theoretic self-information
– Like IDF, expected to indicate term importance
• More on BWESG later…
29
Diaz, Mitra, & Craswell, ACL 2016
• Learn topical word embeddings at query-time
– New flavor of classic IR global vs. local tradeoff
– Compare use of collection vs. external corpora
• No comparison to pseudo-relevance feedback
30
Zamani & Croft, ICTIR 2016a (Est.)
• Provide theoretical justification for estimating
phrasal vectors by averaging term vectors
• Transform cosine vector similarity scores by
softmax vs. sigmoid (consistently better)
– No regular cosine results reported 
• PQV: weighted average of (expanded) query
word vectors based on PRF
– No regular PRF results reported 
31
Zamani & Croft, ICTIR 2016a (Emb.)
• Propose 2 methods for word embedding-
based query expansion (EQE)
– Vary in independence assumptions, akin to RM1
and RM2 in (Lavrenko & Croft, 2001)
• Propose embedding-based rel. model (ERM)
– Can linearly mix (ML, EQE1, or EQE2) + ERM
• Strong evaluation vs. ML and RM3 baselines
32
Ordentlich et al., CIKM 2016
• Word2vec at Scale in Industry
33
Cross-Lingual IR with
Bilingual Word Embeddings
@mattlease
35
36
BWESG: Vulic & Moens, SIGIR 2015
37
Ye et al., ICSE 2016: Finding Bugs
• Given textual bug report (query), find
software files needing to be fixed (documents)
– Saha, Lease, Khursid, Perry (ASE, 2013)
• Augment the Skip-gram model to predict all
code tokens from each text word, and all text
words from each code
• token
38
Discussion
@mattlease
Word Embeddings: Many Details
• Use word2vec (CBOW or skipgram), GloVe, or something else?
• How to set hyper-parameters and select training data/corpora?
• Can multiple embeddings be selected dynamically or combined?
• Blending BIG out-of-domain data with small in-domain data?
• Tradeoff of off-the-shelf embeddings vs. re-training (fine-tuning
or from scratch) for a target domain?
• How much does task or downstream architecture matter?
• How to handle out-of-vocabulary (OOV) query terms?
40
CBOW, SG, GloVe, or …?
• Not clear that any single neural embedding or set
of embeddings performs best in all cases
• Neural vs. Traditional distributed representations?
– Zuccon et al. (2015): “it is not clear yet whether neural
inspired models are generally better than traditional
distributional semantic methods.”
• Models that jointly exploit multiple sets of
embeddings may be worth further pursuing…
– Zhang et al., 2016b
– Neelakantan et al., 2014
41
Which Training Data to Use?
• Zuccon et al. (2015): “the choice of corpus used to construct
word embeddings had little effect on retrieval results.”
• Zamani and Croft (2016b) train GloVe on three external
corpora and report, “there is no significant differences
between the values obtained by employing different corpora
for learning the embedding vectors.”
• Zheng and Callan (2015) : “[the system] performed equally
well with all three external corpora… although no external
corpus was best for all datasets... corpus-specific word
vectors were never best... given the wide range of training
data sizes… from 250 million words to 100 billion words – it is
striking how little correlation there is between search
accuracy and the amount of training data.”
42
Training Embeddings Across Genres
• Query logs (Mitra et al., Sordoni et al.)
• Community Q&A (Zhou et al., 2015)
• Venue comments (Manotumruska et al., 2016)
• Medical texts (De Vine et al., 2014)
• Program. Lang & Comments (Ye et al., ICSE 2016)
• Knowledge Base (Nguyen et al., Neu-IR 2016)
43
Global vs. Local, revisited
• Global word embeddings, trained without reference
to queries, vs. local methods like PRF for exploiting
query-context, appear similarly limited as past
approaches such as topic modeling
– e.g., Yi & Allan (2009) compare topic modeling vs. PRF
• When Neural IR has helped ad-hoc search,
improvements seem modest compared to
known query expansion techniques (e.g. PRF)
• Diaz et al. (2016) learn topic-specific embeddings
44
Handling OOV Query Terms
• Easy option: ignore (some have done this)
– User might not be happy…
• Use unique random embedding for each OOV
– If the same term appears in query & document,
will match and contribute toward score
– Unlikely to yield close matches with other terms
• Misspellings and social spellings (e.g. “kool”)
– Standardize or use character-based model
45
Going Beyond Bag-of-Words
• How to represent longer units of text?
– Simple answer: average word embedding vectors
• Ganguly et al. (2016):
– “... compositionality of the word vectors [only] works
well when applied over a relatively small number of
words... [and] does not scale well for a larger unit of
text, such as passages or full documents, because of
the broad context present within a whole document.”
• Common Future Work: embedding phrases…
46
Measuring Textual Similarity
• Simplest: average vectors & take cosine
• Ganguly et al (2016): document is mixture of Gaussians,
word embeddings are samples
• Zamani & Croft (2016): sigmoid and softmax
transformations of cosine similarity
• Kenter & de Rijke (2015): BM25 extension to incorporate
word embeddings
• Kusner et al. (2015): word movers distance (WMD)
– Kim et al. (2016): WMD for query-document distance
• Fisher Kernel approaches: Zhang et al. (2014), Clinchant &
Perronnin (2013), Zhou et al. (2015)
47
Toward End-to-End
Neural IR Architectures
@mattlease
49
End-to-End Representation Learning
vs. Feature Engineering
• e.g., CDNN: Severyn & Moschitti, SIGIR 2015
50
DSSM: Huang et al., CIKM 2013
51
CLSM: Shen et al., CIKM 2014
52
DRRM: Guo et al., CIKM 2016
• Supervised re-ranking of top 2K QL results
53
Gupta et al., SIGIR 2014
• Mixed-script IR (Hindi-English)
• Using FIRE 2013 data
54
Cohen et al., Neu-IR 2016
• Compare performance of deep and traditional
models across texts of varying lengths
• Deep often better when text is short
55
Future Outlook
@mattlease
Looking for Gains (in all the wrong places)?
• Much Neural IR work to date has investigated traditional document
retrieval (e.g. ad-hoc), seeking improved retrieval accuracy
• This framing may be too narrow
– e.g., Hang Li’s 2016 Neu-IR talk on other search scenarios
– IMHO: We have already invested decades in heavily optimizing vector
representations of queries & documents for matching, including a
many approaches for addressing term mismatch – strong baselines!
• The “real” strength of Neural IR may lie elsewhere, in enabling a
new generation of search scenarios and modalities, such as
– Search via conversational agents (Yan et al., 2016)
– Multi-modal retrieval (Ma et al., 2015c,a)
– Knowledge-based search IR (Nguyen et al., 2016)
– Synthesis of relevant documents (Lioma et al., 2016)
– Future search scenarios, yet to be identified & investigated
57
Industrial Research vs. Academic Research
• With efficacy driven by “big data”, perhaps
massive query logs will be needed to realize
Neural IR’s true potential?
• Will deep learning further divide industrial vs.
academic research?
58
Supervised vs. Unsupervised Deep Learning
• e.g. Supervised learning-to-rank (Liu, 2009) vs.
Unsupervised language or query modeling: Mitra &
Craswell (2015); Mitra (2015); Sordoni et al. (2015)
• LeCun et al. (Nature, 2015) wrote, “we expect
unsupervised learning to become far more
important in the longer term”
• The rise of the Web drove unsupervised and semi-
supervised approaches by the vast unlabeled data it
made available
– Neural IR may best succeed where the biggest data is
naturally found, e.g., private commercial search logs
& public Web content 59
Going Deeper with Characters
“The dominant approach for many NLP tasks are recurrent neural networks,
in particular LSTMs, and convolutional neural networks. However, these
architectures are rather shallow in comparison to the deep convolutional
networks which are very successful in computer vision.
We present a new architecture for text processing which operates directly on
the character level and uses only small convolutions and pooling operations.
We are able to show that the performance of this model increases with the
depth: using up to 29 convolutional layers, we report significant
improvements over the state-of-the-art on several public text classification
tasks. To the best of our knowledge, this is the first time that very deep
convolutional nets have been applied to NLP.”
60
Resources
@mattlease
http://deeplearning.net
Neural Information Retrieval:
A Literature Review
Ye Zhang et al.
https://arxiv.org/abs/1611.06792
Posted 18 November, 2016
62
Neural IR Source Code Released
63
Word Embeddings Released
64
Matt Lease - ml@utexas.edu - @mattlease
Thank You!
UT Austin IR Lab: ir.ischool.utexas.edu
Slides: slideshare.net/mattlease

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

  • 1.
    Deep Learning forInformation Retrieval: Models, Progress, & Opportunities Matt Lease School of Information (“iSchool”) @mattlease University of Texas at Austin ml@utexas.edu Slides: slideshare.net/mattlease
  • 2.
    “The place wherepeople & technology meet” ~ Wobbrock et al., 2009 iSchools now exist at many universities around the world www.ischools.org What’s an Information School (iSchool)? 2
  • 3.
    Why Is ThatRelevant? Collecting Annotator Rationales for Relevance Judgments T. McDonnell, M. Lease, M. Kutlu, & T. Elsayed Best Paper Award, HCOMP 2016 3
  • 4.
    Deep (a.k.a. Neural)IR @mattlease
  • 5.
    Growing Interest in“Deep” IR • Success of Deep Learning (DL) in other fields – Speech recognition, computer vision, & NLP • Growing presence of DL in IR research – e.g., SIGIR 2016 Keynote, Tutorial, & Workshop • Adoption by industry – Bloomberg: Google Turning Its Lucrative Web Search Over to AI Machines. October, 2015 – WIRED: AI is Transforming Google Search. The Rest of the Web is next. February, 2016. 5https://en.wikipedia.org/wiki/RankBrain
  • 6.
    But Does IRNeed Deep Learning? • Chris Manning (Stanford)’s SIGIR Keynote: “I’m certain that deep learning will come to dominate SIGIR over the next couple of years... just like speech, vision, and NLP before it.” • Despite great successes on short texts, longer texts typical of ad-hoc search remain more problematic, with only recent success (e.g., Guo et al., 2016) • As Hang Li eloquently put it, “Does IR (Really) Need Deep Learning?” (SIGIR 2016 Neu-IR workshop) 6
  • 7.
    Neural Information Retrieval: ALiterature Review Ye Zhang et al. https://arxiv.org/abs/1611.06792 Posted 18 November, 2016 7
  • 8.
    A Few Notes •Scope of our Literature Review – We focus on the recent “third wave” of NN research, excluding earlier NN studies – We surveyed papers up through CIKM 2016 – We welcome pointers to any missed studies!  • Terminology: “Neural” IR (much work is not ‘deep’!) • Not all neural networks are ‘deep’ • Not all ‘deep’ models are neural • In practice, “deep learning” & “neural” often used interchangeably 8
  • 9.
    • Word Embeddings •Extending IR Models via Word Embeddings • Discussion • Toward End-to-End Neural IR Architectures • Future Outlook • Resources Roadmap for Talk 9 Slides: slideshare.net/mattlease
  • 10.
  • 11.
    Traditional “one-hot” wordencoding Leads to famous term mismatch problem in IR 11slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial
  • 12.
    Distributional Representations Define wordsby their co-occurrence signatures 12slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial
  • 13.
    Popular Word EmbeddingsToday • word2vec (Mikolov et al., 2013) – sliding window – CBOW: predict center word given window context – Skip-gram: predict context given center word • See also: GloVe (Pennington et al., 2014) – Matrix factorization 13 deeplearning4j.org/ word2vec
  • 14.
    Longer History, OtherAlternatives • Clinchant and Perronnin (2013) use classic LSI (Deerwester et al., 1990), then convert to fixed- length Fisher Vectors (FVs) • Lioma et al. (2015) build on Kiela and Clark (2013)’s prior work in distributional semantics • Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996); see also (Bruza and Song, 2002) – Probabilistic HAL (Azzopardi et al., 2005) – Zuccon et al. (2015) compare HAL vs. word2vec 14
  • 15.
    Active Discriminative Text RepresentationLearning Joint work with Zhang, Lease, & Wallace, AAAI 2017 https://arxiv.org/abs/1606.04212 15
  • 16.
    Active Discriminative TextRepresentation Learning Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212 • Idea: Select next item to label to first optimize feature representation (i.e. word embeddings) before optimizing model to use these features • Approach: Expected Gradient Length (EGL) , sentences vs. documents – EGL-word: Take expected gradient wrt. embeddings only – EGL-sm: Take expected gradient wrt. softmax layer parameters only – EGL-word-doc: normalize each word’s gradient by its DF & sum over the gradients for the top-k words instead of using max only – EGL-Entropy-Beta: Balance expected updates to word gradients (i.e. EGL-word-doc) vs. instance uncertainty (entropy) • First focus on embeddings, then later shift emphasis to entropy 16
  • 17.
    Active Discriminative TextRepresentation Learning Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212 Idea: Select next item to label to first optimize feature representation (i.e. word embeddings) before optimizing model to use these features Results: Sentence Classification 17
  • 18.
    Active Discriminative TextRepresentation Learning Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212 18 Idea: Select next item to label to first optimize feature representation (i.e. word embeddings) before optimizing model to use these features Results: Document Classification
  • 19.
    Extending IR Models withWord Embeddings @mattlease
  • 20.
    Recent IR Workwith Word Embeddings 20
  • 21.
    Clinchant and Perronnin(2013) • Precedes word2vec – Uses classic LSI to induce dist. term representations – Reduces to fixed-length vectors via Fisher Kernel – Compares word vectors via cosine • Consistently worse than DFR baseline 21
  • 22.
    Ponte & Croft(2001): LM for IR P(D|Q) = [ P(Q|D) P(D) ] / P(Q) ∝ P(Q|D) P(D) for fixed query ∝ P(Q|D) assume uniform P(D) P(Q|D) = 𝛼 ∗ 𝑃(𝑞|𝐷)𝑞 + 1 − 𝛼 𝑃(𝑞|𝐶) 22
  • 23.
    Berger & Lafferty(1999) • IR as Statistical Translation – Document d contains word w – w is translated to observed query word q 23
  • 24.
    GLM: Ganguly etal., SIGIR 2015 24
  • 25.
    GLM: Ganguly etal., SIGIR 2015 25
  • 26.
    NTLM: Zuccon etal., ACDS 2015 26
  • 27.
    DeepTR: Zheng &Callan, SIGIR 2015 • Supervised learning of effective term weights – Like RegressionRank (Lease et al., ECIR 2009), (Lease, SIGIR 2009) but without feature engineering • Represent each query term in context by avg. query embedding - term embedding 27
  • 28.
    DESM: Mitra etal., arXiv 2016 • "A crucial detail often overlooked… Word2Vec [produces] two different sets of vectors (…IN and OUT embedding spaces)… By default, Word2Vec discards OUT ... and outputs only IN... “ • “…IN-IN and OUT-OUT cosine similarities are high for words …similar by function or type (typical) …IN-OUT cosine similarities are high between words that often co-occur in the same query or document (topical).” • Compute query & document embeddings by avg. over terms • They map query words to IN and document words to OUT • Compare training on Bing query log vs. Web corpus 28
  • 29.
    BWESG: Vulic &Moens, SIGIR 2015 • As typical, estimate query/document vectors by simple average of constituent term vectors • Alternative: use weighted average by each term’s information-theoretic self-information – Like IDF, expected to indicate term importance • More on BWESG later… 29
  • 30.
    Diaz, Mitra, &Craswell, ACL 2016 • Learn topical word embeddings at query-time – New flavor of classic IR global vs. local tradeoff – Compare use of collection vs. external corpora • No comparison to pseudo-relevance feedback 30
  • 31.
    Zamani & Croft,ICTIR 2016a (Est.) • Provide theoretical justification for estimating phrasal vectors by averaging term vectors • Transform cosine vector similarity scores by softmax vs. sigmoid (consistently better) – No regular cosine results reported  • PQV: weighted average of (expanded) query word vectors based on PRF – No regular PRF results reported  31
  • 32.
    Zamani & Croft,ICTIR 2016a (Emb.) • Propose 2 methods for word embedding- based query expansion (EQE) – Vary in independence assumptions, akin to RM1 and RM2 in (Lavrenko & Croft, 2001) • Propose embedding-based rel. model (ERM) – Can linearly mix (ML, EQE1, or EQE2) + ERM • Strong evaluation vs. ML and RM3 baselines 32
  • 33.
    Ordentlich et al.,CIKM 2016 • Word2vec at Scale in Industry 33
  • 34.
    Cross-Lingual IR with BilingualWord Embeddings @mattlease
  • 35.
  • 36.
  • 37.
    BWESG: Vulic &Moens, SIGIR 2015 37
  • 38.
    Ye et al.,ICSE 2016: Finding Bugs • Given textual bug report (query), find software files needing to be fixed (documents) – Saha, Lease, Khursid, Perry (ASE, 2013) • Augment the Skip-gram model to predict all code tokens from each text word, and all text words from each code • token 38
  • 39.
  • 40.
    Word Embeddings: ManyDetails • Use word2vec (CBOW or skipgram), GloVe, or something else? • How to set hyper-parameters and select training data/corpora? • Can multiple embeddings be selected dynamically or combined? • Blending BIG out-of-domain data with small in-domain data? • Tradeoff of off-the-shelf embeddings vs. re-training (fine-tuning or from scratch) for a target domain? • How much does task or downstream architecture matter? • How to handle out-of-vocabulary (OOV) query terms? 40
  • 41.
    CBOW, SG, GloVe,or …? • Not clear that any single neural embedding or set of embeddings performs best in all cases • Neural vs. Traditional distributed representations? – Zuccon et al. (2015): “it is not clear yet whether neural inspired models are generally better than traditional distributional semantic methods.” • Models that jointly exploit multiple sets of embeddings may be worth further pursuing… – Zhang et al., 2016b – Neelakantan et al., 2014 41
  • 42.
    Which Training Datato Use? • Zuccon et al. (2015): “the choice of corpus used to construct word embeddings had little effect on retrieval results.” • Zamani and Croft (2016b) train GloVe on three external corpora and report, “there is no significant differences between the values obtained by employing different corpora for learning the embedding vectors.” • Zheng and Callan (2015) : “[the system] performed equally well with all three external corpora… although no external corpus was best for all datasets... corpus-specific word vectors were never best... given the wide range of training data sizes… from 250 million words to 100 billion words – it is striking how little correlation there is between search accuracy and the amount of training data.” 42
  • 43.
    Training Embeddings AcrossGenres • Query logs (Mitra et al., Sordoni et al.) • Community Q&A (Zhou et al., 2015) • Venue comments (Manotumruska et al., 2016) • Medical texts (De Vine et al., 2014) • Program. Lang & Comments (Ye et al., ICSE 2016) • Knowledge Base (Nguyen et al., Neu-IR 2016) 43
  • 44.
    Global vs. Local,revisited • Global word embeddings, trained without reference to queries, vs. local methods like PRF for exploiting query-context, appear similarly limited as past approaches such as topic modeling – e.g., Yi & Allan (2009) compare topic modeling vs. PRF • When Neural IR has helped ad-hoc search, improvements seem modest compared to known query expansion techniques (e.g. PRF) • Diaz et al. (2016) learn topic-specific embeddings 44
  • 45.
    Handling OOV QueryTerms • Easy option: ignore (some have done this) – User might not be happy… • Use unique random embedding for each OOV – If the same term appears in query & document, will match and contribute toward score – Unlikely to yield close matches with other terms • Misspellings and social spellings (e.g. “kool”) – Standardize or use character-based model 45
  • 46.
    Going Beyond Bag-of-Words •How to represent longer units of text? – Simple answer: average word embedding vectors • Ganguly et al. (2016): – “... compositionality of the word vectors [only] works well when applied over a relatively small number of words... [and] does not scale well for a larger unit of text, such as passages or full documents, because of the broad context present within a whole document.” • Common Future Work: embedding phrases… 46
  • 47.
    Measuring Textual Similarity •Simplest: average vectors & take cosine • Ganguly et al (2016): document is mixture of Gaussians, word embeddings are samples • Zamani & Croft (2016): sigmoid and softmax transformations of cosine similarity • Kenter & de Rijke (2015): BM25 extension to incorporate word embeddings • Kusner et al. (2015): word movers distance (WMD) – Kim et al. (2016): WMD for query-document distance • Fisher Kernel approaches: Zhang et al. (2014), Clinchant & Perronnin (2013), Zhou et al. (2015) 47
  • 48.
    Toward End-to-End Neural IRArchitectures @mattlease
  • 49.
  • 50.
    End-to-End Representation Learning vs.Feature Engineering • e.g., CDNN: Severyn & Moschitti, SIGIR 2015 50
  • 51.
    DSSM: Huang etal., CIKM 2013 51
  • 52.
    CLSM: Shen etal., CIKM 2014 52
  • 53.
    DRRM: Guo etal., CIKM 2016 • Supervised re-ranking of top 2K QL results 53
  • 54.
    Gupta et al.,SIGIR 2014 • Mixed-script IR (Hindi-English) • Using FIRE 2013 data 54
  • 55.
    Cohen et al.,Neu-IR 2016 • Compare performance of deep and traditional models across texts of varying lengths • Deep often better when text is short 55
  • 56.
  • 57.
    Looking for Gains(in all the wrong places)? • Much Neural IR work to date has investigated traditional document retrieval (e.g. ad-hoc), seeking improved retrieval accuracy • This framing may be too narrow – e.g., Hang Li’s 2016 Neu-IR talk on other search scenarios – IMHO: We have already invested decades in heavily optimizing vector representations of queries & documents for matching, including a many approaches for addressing term mismatch – strong baselines! • The “real” strength of Neural IR may lie elsewhere, in enabling a new generation of search scenarios and modalities, such as – Search via conversational agents (Yan et al., 2016) – Multi-modal retrieval (Ma et al., 2015c,a) – Knowledge-based search IR (Nguyen et al., 2016) – Synthesis of relevant documents (Lioma et al., 2016) – Future search scenarios, yet to be identified & investigated 57
  • 58.
    Industrial Research vs.Academic Research • With efficacy driven by “big data”, perhaps massive query logs will be needed to realize Neural IR’s true potential? • Will deep learning further divide industrial vs. academic research? 58
  • 59.
    Supervised vs. UnsupervisedDeep Learning • e.g. Supervised learning-to-rank (Liu, 2009) vs. Unsupervised language or query modeling: Mitra & Craswell (2015); Mitra (2015); Sordoni et al. (2015) • LeCun et al. (Nature, 2015) wrote, “we expect unsupervised learning to become far more important in the longer term” • The rise of the Web drove unsupervised and semi- supervised approaches by the vast unlabeled data it made available – Neural IR may best succeed where the biggest data is naturally found, e.g., private commercial search logs & public Web content 59
  • 60.
    Going Deeper withCharacters “The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which are very successful in computer vision. We present a new architecture for text processing which operates directly on the character level and uses only small convolutions and pooling operations. We are able to show that the performance of this model increases with the depth: using up to 29 convolutional layers, we report significant improvements over the state-of-the-art on several public text classification tasks. To the best of our knowledge, this is the first time that very deep convolutional nets have been applied to NLP.” 60
  • 61.
  • 62.
    Neural Information Retrieval: ALiterature Review Ye Zhang et al. https://arxiv.org/abs/1611.06792 Posted 18 November, 2016 62
  • 63.
    Neural IR SourceCode Released 63
  • 64.
  • 65.
    Matt Lease -ml@utexas.edu - @mattlease Thank You! UT Austin IR Lab: ir.ischool.utexas.edu Slides: slideshare.net/mattlease