Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Deep Learning for Information Retrieval:
Models, Progress, & Opportunities
Matt Lease
School of Information (“iSchool”) @m...
“The place where people & technology meet”
~ Wobbrock et al., 2009
iSchools now exist at many universities around the worl...
Why Is That Relevant? Collecting Annotator
Rationales for Relevance Judgments
T. McDonnell, M. Lease, M. Kutlu, & T. Elsay...
Deep (a.k.a. Neural) IR
@mattlease
Growing Interest in “Deep” IR
• Success of Deep Learning (DL) in other fields
– Speech recognition, computer vision, & NLP...
But Does IR Need Deep Learning?
• Chris Manning (Stanford)’s SIGIR Keynote:
“I’m certain that deep learning will come
to d...
Neural Information Retrieval:
A Literature Review
Ye Zhang et al.
https://arxiv.org/abs/1611.06792
Posted 18 November, 201...
A Few Notes
• Scope of our Literature Review
– We focus on the recent “third wave” of NN research,
excluding earlier NN st...
• Word Embeddings
• Extending IR Models via Word Embeddings
• Discussion
• Toward End-to-End Neural IR Architectures
• Fut...
Word Embeddings
@mattlease
Traditional “one-hot” word encoding
Leads to famous term mismatch problem in IR
11slide courtesy of Richard Socher (Stanfo...
Distributional Representations
Define words by their co-occurrence signatures
12slide courtesy of Richard Socher (Stanford...
Popular Word Embeddings Today
• word2vec (Mikolov et al., 2013) – sliding window
– CBOW: predict center word given window ...
Longer History, Other Alternatives
• Clinchant and Perronnin (2013) use classic LSI
(Deerwester et al., 1990), then conver...
Active Discriminative Text
Representation Learning
Joint work with
Zhang, Lease, & Wallace, AAAI 2017
https://arxiv.org/ab...
Active Discriminative Text Representation Learning
Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212
•...
Active Discriminative Text Representation Learning
Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212
I...
Active Discriminative Text Representation Learning
Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212
1...
Extending IR Models
with Word Embeddings
@mattlease
Recent IR Work with Word Embeddings
20
Clinchant and Perronnin (2013)
• Precedes word2vec
– Uses classic LSI to induce dist. term representations
– Reduces to fi...
Ponte & Croft (2001): LM for IR
P(D|Q) = [ P(Q|D) P(D) ] / P(Q)
∝ P(Q|D) P(D) for fixed query
∝ P(Q|D) assume uniform P(D)...
Berger & Lafferty (1999)
• IR as Statistical Translation
– Document d contains word w
– w is translated to observed query ...
GLM: Ganguly et al., SIGIR 2015
24
GLM: Ganguly et al., SIGIR 2015
25
NTLM: Zuccon et al., ACDS 2015
26
DeepTR: Zheng & Callan, SIGIR 2015
• Supervised learning of effective term weights
– Like RegressionRank (Lease et al., EC...
DESM: Mitra et al., arXiv 2016
• "A crucial detail often overlooked… Word2Vec [produces] two
different sets of vectors (…I...
BWESG: Vulic & Moens, SIGIR 2015
• As typical, estimate query/document vectors
by simple average of constituent term vecto...
Diaz, Mitra, & Craswell, ACL 2016
• Learn topical word embeddings at query-time
– New flavor of classic IR global vs. loca...
Zamani & Croft, ICTIR 2016a (Est.)
• Provide theoretical justification for estimating
phrasal vectors by averaging term ve...
Zamani & Croft, ICTIR 2016a (Emb.)
• Propose 2 methods for word embedding-
based query expansion (EQE)
– Vary in independe...
Ordentlich et al., CIKM 2016
• Word2vec at Scale in Industry
33
Cross-Lingual IR with
Bilingual Word Embeddings
@mattlease
35
36
BWESG: Vulic & Moens, SIGIR 2015
37
Ye et al., ICSE 2016: Finding Bugs
• Given textual bug report (query), find
software files needing to be fixed (documents)...
Discussion
@mattlease
Word Embeddings: Many Details
• Use word2vec (CBOW or skipgram), GloVe, or something else?
• How to set hyper-parameters a...
CBOW, SG, GloVe, or …?
• Not clear that any single neural embedding or set
of embeddings performs best in all cases
• Neur...
Which Training Data to Use?
• Zuccon et al. (2015): “the choice of corpus used to construct
word embeddings had little eff...
Training Embeddings Across Genres
• Query logs (Mitra et al., Sordoni et al.)
• Community Q&A (Zhou et al., 2015)
• Venue ...
Global vs. Local, revisited
• Global word embeddings, trained without reference
to queries, vs. local methods like PRF for...
Handling OOV Query Terms
• Easy option: ignore (some have done this)
– User might not be happy…
• Use unique random embedd...
Going Beyond Bag-of-Words
• How to represent longer units of text?
– Simple answer: average word embedding vectors
• Gangu...
Measuring Textual Similarity
• Simplest: average vectors & take cosine
• Ganguly et al (2016): document is mixture of Gaus...
Toward End-to-End
Neural IR Architectures
@mattlease
49
End-to-End Representation Learning
vs. Feature Engineering
• e.g., CDNN: Severyn & Moschitti, SIGIR 2015
50
DSSM: Huang et al., CIKM 2013
51
CLSM: Shen et al., CIKM 2014
52
DRRM: Guo et al., CIKM 2016
• Supervised re-ranking of top 2K QL results
53
Gupta et al., SIGIR 2014
• Mixed-script IR (Hindi-English)
• Using FIRE 2013 data
54
Cohen et al., Neu-IR 2016
• Compare performance of deep and traditional
models across texts of varying lengths
• Deep ofte...
Future Outlook
@mattlease
Looking for Gains (in all the wrong places)?
• Much Neural IR work to date has investigated traditional document
retrieval...
Industrial Research vs. Academic Research
• With efficacy driven by “big data”, perhaps
massive query logs will be needed ...
Supervised vs. Unsupervised Deep Learning
• e.g. Supervised learning-to-rank (Liu, 2009) vs.
Unsupervised language or quer...
Going Deeper with Characters
“The dominant approach for many NLP tasks are recurrent neural networks,
in particular LSTMs,...
Resources
@mattlease
http://deeplearning.net
Neural Information Retrieval:
A Literature Review
Ye Zhang et al.
https://arxiv.org/abs/1611.06792
Posted 18 November, 201...
Neural IR Source Code Released
63
Word Embeddings Released
64
Matt Lease - ml@utexas.edu - @mattlease
Thank You!
UT Austin IR Lab: ir.ischool.utexas.edu
Slides: slideshare.net/mattlease
Upcoming SlideShare
Loading in …5
×

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

3,599 views

Published on

Talk given at the 8th Forum for Information Retrieval Evaluation (FIRE, http://fire.irsi.res.in/fire/2016/), December 10, 2016, and at the Qatar Computing Research Institute (QCRI), December 15, 2016.

Published in: Technology
  • Be the first to comment

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

  1. 1. Deep Learning for Information Retrieval: Models, Progress, & Opportunities Matt Lease School of Information (“iSchool”) @mattlease University of Texas at Austin ml@utexas.edu Slides: slideshare.net/mattlease
  2. 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 iSchools now exist at many universities around the world www.ischools.org What’s an Information School (iSchool)? 2
  3. 3. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments T. McDonnell, M. Lease, M. Kutlu, & T. Elsayed Best Paper Award, HCOMP 2016 3
  4. 4. Deep (a.k.a. Neural) IR @mattlease
  5. 5. Growing Interest in “Deep” IR • Success of Deep Learning (DL) in other fields – Speech recognition, computer vision, & NLP • Growing presence of DL in IR research – e.g., SIGIR 2016 Keynote, Tutorial, & Workshop • Adoption by industry – Bloomberg: Google Turning Its Lucrative Web Search Over to AI Machines. October, 2015 – WIRED: AI is Transforming Google Search. The Rest of the Web is next. February, 2016. 5https://en.wikipedia.org/wiki/RankBrain
  6. 6. But Does IR Need Deep Learning? • Chris Manning (Stanford)’s SIGIR Keynote: “I’m certain that deep learning will come to dominate SIGIR over the next couple of years... just like speech, vision, and NLP before it.” • Despite great successes on short texts, longer texts typical of ad-hoc search remain more problematic, with only recent success (e.g., Guo et al., 2016) • As Hang Li eloquently put it, “Does IR (Really) Need Deep Learning?” (SIGIR 2016 Neu-IR workshop) 6
  7. 7. Neural Information Retrieval: A Literature Review Ye Zhang et al. https://arxiv.org/abs/1611.06792 Posted 18 November, 2016 7
  8. 8. A Few Notes • Scope of our Literature Review – We focus on the recent “third wave” of NN research, excluding earlier NN studies – We surveyed papers up through CIKM 2016 – We welcome pointers to any missed studies!  • Terminology: “Neural” IR (much work is not ‘deep’!) • Not all neural networks are ‘deep’ • Not all ‘deep’ models are neural • In practice, “deep learning” & “neural” often used interchangeably 8
  9. 9. • Word Embeddings • Extending IR Models via Word Embeddings • Discussion • Toward End-to-End Neural IR Architectures • Future Outlook • Resources Roadmap for Talk 9 Slides: slideshare.net/mattlease
  10. 10. Word Embeddings @mattlease
  11. 11. Traditional “one-hot” word encoding Leads to famous term mismatch problem in IR 11slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial
  12. 12. Distributional Representations Define words by their co-occurrence signatures 12slide courtesy of Richard Socher (Stanford)’s NAACL Tutorial
  13. 13. Popular Word Embeddings Today • word2vec (Mikolov et al., 2013) – sliding window – CBOW: predict center word given window context – Skip-gram: predict context given center word • See also: GloVe (Pennington et al., 2014) – Matrix factorization 13 deeplearning4j.org/ word2vec
  14. 14. Longer History, Other Alternatives • Clinchant and Perronnin (2013) use classic LSI (Deerwester et al., 1990), then convert to fixed- length Fisher Vectors (FVs) • Lioma et al. (2015) build on Kiela and Clark (2013)’s prior work in distributional semantics • Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996); see also (Bruza and Song, 2002) – Probabilistic HAL (Azzopardi et al., 2005) – Zuccon et al. (2015) compare HAL vs. word2vec 14
  15. 15. Active Discriminative Text Representation Learning Joint work with Zhang, Lease, & Wallace, AAAI 2017 https://arxiv.org/abs/1606.04212 15
  16. 16. Active Discriminative Text Representation Learning Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212 • Idea: Select next item to label to first optimize feature representation (i.e. word embeddings) before optimizing model to use these features • Approach: Expected Gradient Length (EGL) , sentences vs. documents – EGL-word: Take expected gradient wrt. embeddings only – EGL-sm: Take expected gradient wrt. softmax layer parameters only – EGL-word-doc: normalize each word’s gradient by its DF & sum over the gradients for the top-k words instead of using max only – EGL-Entropy-Beta: Balance expected updates to word gradients (i.e. EGL-word-doc) vs. instance uncertainty (entropy) • First focus on embeddings, then later shift emphasis to entropy 16
  17. 17. Active Discriminative Text Representation Learning Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212 Idea: Select next item to label to first optimize feature representation (i.e. word embeddings) before optimizing model to use these features Results: Sentence Classification 17
  18. 18. Active Discriminative Text Representation Learning Zhang, Lease, & Wallace, AAAI 2017 • https://arxiv.org/abs/1606.04212 18 Idea: Select next item to label to first optimize feature representation (i.e. word embeddings) before optimizing model to use these features Results: Document Classification
  19. 19. Extending IR Models with Word Embeddings @mattlease
  20. 20. Recent IR Work with Word Embeddings 20
  21. 21. Clinchant and Perronnin (2013) • Precedes word2vec – Uses classic LSI to induce dist. term representations – Reduces to fixed-length vectors via Fisher Kernel – Compares word vectors via cosine • Consistently worse than DFR baseline 21
  22. 22. Ponte & Croft (2001): LM for IR P(D|Q) = [ P(Q|D) P(D) ] / P(Q) ∝ P(Q|D) P(D) for fixed query ∝ P(Q|D) assume uniform P(D) P(Q|D) = 𝛼 ∗ 𝑃(𝑞|𝐷)𝑞 + 1 − 𝛼 𝑃(𝑞|𝐶) 22
  23. 23. Berger & Lafferty (1999) • IR as Statistical Translation – Document d contains word w – w is translated to observed query word q 23
  24. 24. GLM: Ganguly et al., SIGIR 2015 24
  25. 25. GLM: Ganguly et al., SIGIR 2015 25
  26. 26. NTLM: Zuccon et al., ACDS 2015 26
  27. 27. DeepTR: Zheng & Callan, SIGIR 2015 • Supervised learning of effective term weights – Like RegressionRank (Lease et al., ECIR 2009), (Lease, SIGIR 2009) but without feature engineering • Represent each query term in context by avg. query embedding - term embedding 27
  28. 28. DESM: Mitra et al., arXiv 2016 • "A crucial detail often overlooked… Word2Vec [produces] two different sets of vectors (…IN and OUT embedding spaces)… By default, Word2Vec discards OUT ... and outputs only IN... “ • “…IN-IN and OUT-OUT cosine similarities are high for words …similar by function or type (typical) …IN-OUT cosine similarities are high between words that often co-occur in the same query or document (topical).” • Compute query & document embeddings by avg. over terms • They map query words to IN and document words to OUT • Compare training on Bing query log vs. Web corpus 28
  29. 29. BWESG: Vulic & Moens, SIGIR 2015 • As typical, estimate query/document vectors by simple average of constituent term vectors • Alternative: use weighted average by each term’s information-theoretic self-information – Like IDF, expected to indicate term importance • More on BWESG later… 29
  30. 30. Diaz, Mitra, & Craswell, ACL 2016 • Learn topical word embeddings at query-time – New flavor of classic IR global vs. local tradeoff – Compare use of collection vs. external corpora • No comparison to pseudo-relevance feedback 30
  31. 31. Zamani & Croft, ICTIR 2016a (Est.) • Provide theoretical justification for estimating phrasal vectors by averaging term vectors • Transform cosine vector similarity scores by softmax vs. sigmoid (consistently better) – No regular cosine results reported  • PQV: weighted average of (expanded) query word vectors based on PRF – No regular PRF results reported  31
  32. 32. Zamani & Croft, ICTIR 2016a (Emb.) • Propose 2 methods for word embedding- based query expansion (EQE) – Vary in independence assumptions, akin to RM1 and RM2 in (Lavrenko & Croft, 2001) • Propose embedding-based rel. model (ERM) – Can linearly mix (ML, EQE1, or EQE2) + ERM • Strong evaluation vs. ML and RM3 baselines 32
  33. 33. Ordentlich et al., CIKM 2016 • Word2vec at Scale in Industry 33
  34. 34. Cross-Lingual IR with Bilingual Word Embeddings @mattlease
  35. 35. 35
  36. 36. 36
  37. 37. BWESG: Vulic & Moens, SIGIR 2015 37
  38. 38. Ye et al., ICSE 2016: Finding Bugs • Given textual bug report (query), find software files needing to be fixed (documents) – Saha, Lease, Khursid, Perry (ASE, 2013) • Augment the Skip-gram model to predict all code tokens from each text word, and all text words from each code • token 38
  39. 39. Discussion @mattlease
  40. 40. Word Embeddings: Many Details • Use word2vec (CBOW or skipgram), GloVe, or something else? • How to set hyper-parameters and select training data/corpora? • Can multiple embeddings be selected dynamically or combined? • Blending BIG out-of-domain data with small in-domain data? • Tradeoff of off-the-shelf embeddings vs. re-training (fine-tuning or from scratch) for a target domain? • How much does task or downstream architecture matter? • How to handle out-of-vocabulary (OOV) query terms? 40
  41. 41. CBOW, SG, GloVe, or …? • Not clear that any single neural embedding or set of embeddings performs best in all cases • Neural vs. Traditional distributed representations? – Zuccon et al. (2015): “it is not clear yet whether neural inspired models are generally better than traditional distributional semantic methods.” • Models that jointly exploit multiple sets of embeddings may be worth further pursuing… – Zhang et al., 2016b – Neelakantan et al., 2014 41
  42. 42. Which Training Data to Use? • Zuccon et al. (2015): “the choice of corpus used to construct word embeddings had little effect on retrieval results.” • Zamani and Croft (2016b) train GloVe on three external corpora and report, “there is no significant differences between the values obtained by employing different corpora for learning the embedding vectors.” • Zheng and Callan (2015) : “[the system] performed equally well with all three external corpora… although no external corpus was best for all datasets... corpus-specific word vectors were never best... given the wide range of training data sizes… from 250 million words to 100 billion words – it is striking how little correlation there is between search accuracy and the amount of training data.” 42
  43. 43. Training Embeddings Across Genres • Query logs (Mitra et al., Sordoni et al.) • Community Q&A (Zhou et al., 2015) • Venue comments (Manotumruska et al., 2016) • Medical texts (De Vine et al., 2014) • Program. Lang & Comments (Ye et al., ICSE 2016) • Knowledge Base (Nguyen et al., Neu-IR 2016) 43
  44. 44. Global vs. Local, revisited • Global word embeddings, trained without reference to queries, vs. local methods like PRF for exploiting query-context, appear similarly limited as past approaches such as topic modeling – e.g., Yi & Allan (2009) compare topic modeling vs. PRF • When Neural IR has helped ad-hoc search, improvements seem modest compared to known query expansion techniques (e.g. PRF) • Diaz et al. (2016) learn topic-specific embeddings 44
  45. 45. Handling OOV Query Terms • Easy option: ignore (some have done this) – User might not be happy… • Use unique random embedding for each OOV – If the same term appears in query & document, will match and contribute toward score – Unlikely to yield close matches with other terms • Misspellings and social spellings (e.g. “kool”) – Standardize or use character-based model 45
  46. 46. Going Beyond Bag-of-Words • How to represent longer units of text? – Simple answer: average word embedding vectors • Ganguly et al. (2016): – “... compositionality of the word vectors [only] works well when applied over a relatively small number of words... [and] does not scale well for a larger unit of text, such as passages or full documents, because of the broad context present within a whole document.” • Common Future Work: embedding phrases… 46
  47. 47. Measuring Textual Similarity • Simplest: average vectors & take cosine • Ganguly et al (2016): document is mixture of Gaussians, word embeddings are samples • Zamani & Croft (2016): sigmoid and softmax transformations of cosine similarity • Kenter & de Rijke (2015): BM25 extension to incorporate word embeddings • Kusner et al. (2015): word movers distance (WMD) – Kim et al. (2016): WMD for query-document distance • Fisher Kernel approaches: Zhang et al. (2014), Clinchant & Perronnin (2013), Zhou et al. (2015) 47
  48. 48. Toward End-to-End Neural IR Architectures @mattlease
  49. 49. 49
  50. 50. End-to-End Representation Learning vs. Feature Engineering • e.g., CDNN: Severyn & Moschitti, SIGIR 2015 50
  51. 51. DSSM: Huang et al., CIKM 2013 51
  52. 52. CLSM: Shen et al., CIKM 2014 52
  53. 53. DRRM: Guo et al., CIKM 2016 • Supervised re-ranking of top 2K QL results 53
  54. 54. Gupta et al., SIGIR 2014 • Mixed-script IR (Hindi-English) • Using FIRE 2013 data 54
  55. 55. Cohen et al., Neu-IR 2016 • Compare performance of deep and traditional models across texts of varying lengths • Deep often better when text is short 55
  56. 56. Future Outlook @mattlease
  57. 57. Looking for Gains (in all the wrong places)? • Much Neural IR work to date has investigated traditional document retrieval (e.g. ad-hoc), seeking improved retrieval accuracy • This framing may be too narrow – e.g., Hang Li’s 2016 Neu-IR talk on other search scenarios – IMHO: We have already invested decades in heavily optimizing vector representations of queries & documents for matching, including a many approaches for addressing term mismatch – strong baselines! • The “real” strength of Neural IR may lie elsewhere, in enabling a new generation of search scenarios and modalities, such as – Search via conversational agents (Yan et al., 2016) – Multi-modal retrieval (Ma et al., 2015c,a) – Knowledge-based search IR (Nguyen et al., 2016) – Synthesis of relevant documents (Lioma et al., 2016) – Future search scenarios, yet to be identified & investigated 57
  58. 58. Industrial Research vs. Academic Research • With efficacy driven by “big data”, perhaps massive query logs will be needed to realize Neural IR’s true potential? • Will deep learning further divide industrial vs. academic research? 58
  59. 59. Supervised vs. Unsupervised Deep Learning • e.g. Supervised learning-to-rank (Liu, 2009) vs. Unsupervised language or query modeling: Mitra & Craswell (2015); Mitra (2015); Sordoni et al. (2015) • LeCun et al. (Nature, 2015) wrote, “we expect unsupervised learning to become far more important in the longer term” • The rise of the Web drove unsupervised and semi- supervised approaches by the vast unlabeled data it made available – Neural IR may best succeed where the biggest data is naturally found, e.g., private commercial search logs & public Web content 59
  60. 60. Going Deeper with Characters “The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which are very successful in computer vision. We present a new architecture for text processing which operates directly on the character level and uses only small convolutions and pooling operations. We are able to show that the performance of this model increases with the depth: using up to 29 convolutional layers, we report significant improvements over the state-of-the-art on several public text classification tasks. To the best of our knowledge, this is the first time that very deep convolutional nets have been applied to NLP.” 60
  61. 61. Resources @mattlease http://deeplearning.net
  62. 62. Neural Information Retrieval: A Literature Review Ye Zhang et al. https://arxiv.org/abs/1611.06792 Posted 18 November, 2016 62
  63. 63. Neural IR Source Code Released 63
  64. 64. Word Embeddings Released 64
  65. 65. Matt Lease - ml@utexas.edu - @mattlease Thank You! UT Austin IR Lab: ir.ischool.utexas.edu Slides: slideshare.net/mattlease

×