- 1. Deep Learning for Search (Alternative title: “Neural Information Retrieval”) Bhaskar Mitra Principal Applied Scientist, Microsoft PhD Student, University College London @UnderdogGeek bmitra@microsoft.com Background image modified from source: https://commons.wikimedia.org/wiki/File:Howrah_Bridge_from_the_western_bank_of_the_Ganges.jpg
- 2. I am an applied researcher at Bing. Based in Microsoft Research Montreal. Previously worked for Microsoft in Hyderabad, Seattle, and Cambridge. Part-time PhD candidate at University College London. My research is on neural methods for information retrieval. Originally born and grew up in Kolkata. Bing UCL MSR Cambridge MSR Montreal MSR Montreal MSR Cambridge UCL works for used to be based in based in doing PhD at
- 3. Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks.
- 4. THE STATE OF NEURAL INFORMATION RETRIEVAL GROWING PUBLICATION POPULARITY AT TOP IR CONFERENCES STRONG PERFORMANCE AGAINST TRADITIONAL METHODS IN TREC 2019
- 5. Download the slides: http://bit.ly/dl4search-fire2019 Download the free book: http://bit.ly/neuralir-intro Download TREC Deep Learning Track data: https://microsoft.github.io/TREC-2019-Deep-Learning/ @UnderdogGeek bmitra@microsoft.com RESOURCES
- 6. AGENDA Let’s focus on the fundamentals! Please feel free to interrupt and ask lots of questions!
- 7. THE SEARCH TASK 10 MINS (10:05 AM - 10:15 AM)
- 8. INFORMATION RETRIEVAL (IR) User has an information need There exists a collection of information resources IR is the activity of retrieving the information resources relevant to the information need
- 9. EXAMPLE OF AN IR TASK (WEB SEARCH) User expresses information need as a short textual query The search engine retrieves top relevant web documents as information resources We will use web search as the main example of an IR task in the rest of this lecture query Information need retrieval system indexes a document corpus results ranking (document list) Relevance (documents satisfy information need)
- 10. DESIDERATA Decades of IR research has identified some key factors that text retrieval models should consider Traditional IR models typically incorporate one of more of these Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness
- 11. DESIDERATA A document that contains more occurrences of the query term(s) is more likely to be relevant Tip: consider term frequency (TF) Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
- 12. DESIDERATA A rare term (e.g., “msmarco”) is likely to be more informative than a common term (e.g., “and”) Tip: consider inverse document frequency (IDF) Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness more informative than
- 13. DESIDERATA A term should not contribute disproportionately Increase in TF should have larger impact for smaller TFs Tip: put a saturation function over the TF Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
- 14. DESIDERATA A document containing more non-relevant terms is likely to be less relevant Tip: perform document length normalization Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
- 15. DESIDERATA A document containing query terms in close proximity is likely to be more relevant than one where the terms occur far away from each other Tip: consider proximity features Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
- 16. DESIDERATA Term matches earlier in the document may indicate more likelihood of the document being relevant Tip: consider position of term matches Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
- 17. DESIDERATA Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness uk prime minister The query and the document may refer to the same concept using different vocabularies Tip: consider expanding the query or document, or matching query and document terms in a latent space theresa may
- 18. DESIDERATA Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness albuquerque By inspecting other terms in the document we may infer if the document is about the query term Tip: consider expanding the query or matching the query terms with the document terms in a latent space Passage about Albuquerque Passage not about Albuquerque
- 19. EXAMPLES OF RANKING METRICS Discounted Cumulative Gain (DCG) 𝐷𝐶𝐺@𝑘 = 𝑖=1 𝑘 2 𝑟𝑒𝑙𝑖 − 1 𝑙𝑜𝑔2 𝑖 + 1 Reciprocal Rank (RR) 𝑅𝑅@𝑘 = max 1<𝑖<𝑘 𝑟𝑒𝑙𝑖 𝑖
- 20. FUNDAMENTALS OF NEURAL NETWORKS 30 MINS (10:15 AM - 10:45 AM)
- 21. NEURAL NETWORKS Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (σ) Popular choices for σ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
- 22. FUNDAMENTAL MACHINE LEARNING TASKS
- 23. SQUARED LOSS The squared loss is a popular loss function for regression tasks
- 24. THE SOFTMAX FUNCTION In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
- 25. CROSS ENTROPY The cross entropy between two probability distributions 𝑝 and 𝑞 over a discrete set of events is given by, If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all other values of 𝑖 then,
- 26. CROSS ENTROPY WITH SOFTMAX LOSS Cross entropy with softmax is a popular loss function for classification
- 27. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕𝑙 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 STOCHASTIC GRADIENT DESCENT (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
- 28. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕 𝑦 − 𝑦2 2 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
- 29. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
- 30. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
- 31. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
- 32. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
- 33. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤1. 𝑥 + 𝑏1 × 𝑥 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
- 34. COMPUTATION NETWORKS The “Lego” approach to specifying DNN architectures Library of computation nodes, each node defines logic for: 1. Forward pass: compute output given input 2. Backward pass: compute gradient of loss w.r.t. inputs, given gradient of loss w.r.t. outputs 3. Parameter gradient: compute gradient of loss w.r.t. parameters, given gradient of loss w.r.t. outputs Chain nodes to create bigger and more complex networks
- 35. TOOLKITS A diverse set of options to choose from! Figure from https://towardsdatascience.com/battle-of- the-deep-learning-frameworks-part-i-cff0e3841750
- 36. TRAINING A SIMPLE IMAGE CLASSIFIER W/ PYTORCH First, we define the model architecture Next, we specify loss function and optimization algorithm Finally, loop over training data to optimize model parameters https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
- 37. REALLY DEEP NEURAL NETWORKS (Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)
- 38. WHY ADDING DEPTH HELPS http://playground.tensorflow.org
- 39. can’t separate using a linear model! Input features Label surface kerberos book library 1 0 1 0 ✓ 1 1 0 0 ✗ 0 1 0 1 ✓ 0 0 1 1 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer… VISUAL MOTIVATION FOR HIDDEN UNITS Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗
- 40. VISUAL MOTIVATION FOR HIDDEN UNITS Or more succinctly… Input features Hidden layer Label surface kerberos book library H1 H2 1 0 1 0 1 0 ✓ 1 1 0 0 0 0 ✗ 0 1 0 1 0 1 ✓ 0 0 1 1 0 0 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer… can separate using a linear model! Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗
- 41. WHY ADDING DEPTH HELPS Deeper networks can split the input space in many (non-independent) linear regions than shallow networks Montúfar, Pascanu, Cho and Bengio. On the number of linear regions of deep neural networks NIPS 2014
- 42. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019. THE LOTTERY TICKET HYPOTHESIS
- 43. BIAS-VARIANCE TRADE-OFF IN THE DEEP LEARNING ERA Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.
- 44. LEARNING TO RANK 35 MINS (10:45 AM - 11:20 AM)
- 45. MOST IR SYSTEMS PRESENT RANKED LISTS OF RETRIEVED INFORMATION ARTIFACTS
- 46. THE UNREASONABLE EFFECTIVENESS OF SIMPLE LTR BASED APPROACHES
- 47. LEARNING TO RANK (LTR) ”... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
- 48. LEARNING TO RANK (LTR) L2R models represent a rankable item—e.g., a document—given some context—e.g., a user-issued query—as a numerical vector 𝑥 ∈ ℝ 𝑛 The ranking model 𝑓: 𝑥 → ℝ is trained to map the vector to a real-valued score such that relevant items are scored higher. Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
- 49. WHY IS RANKING CHALLENGING? Rank based metrics, such as DCG or MRR, are non-smooth / non-differentiable
- 50. APPROACHES Pointwise approach Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑. Pairwise approach Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCG—difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
- 51. FEATURES They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
- 52. FEATURES Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
- 53. POINTWISE OBJECTIVES Regression loss Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑 e.g., square loss for binary or categorical labels, where, 𝑦 𝑞,𝑑 is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction 0 1 1
- 54. POINTWISE OBJECTIVES Classification loss Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑 e.g., cross-entropy with softmax over categorical labels 𝑌 [Li et al., 2008], where, 𝑠 𝑦 𝑞,𝑑 is the model’s score for label 𝑦 𝑞,𝑑 labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
- 55. PAIRWISE OBJECTIVES Pairwise loss generally has the following form [Chen et al., 2009], where, 𝜙 can be, • Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000] • Exponential function 𝜙 𝑧 = 𝑒−𝑧 [Freund et al., 2003] • Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧 [Burges et al., 2005] • Others… Pairwise loss minimizes the average number of inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is ranked higher than 𝑑𝑖 Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 , Feature vectors: 𝑥𝑖 and 𝑥𝑗 Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗 Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
- 56. PAIRWISE OBJECTIVES RankNet loss Pairwise loss function proposed by Burges et al. [2005]—an industry favourite [Burges, 2015] Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡ 𝑒 𝛾.𝑠 𝑖 𝑒 𝛾.𝑠 𝑖 +𝑒 𝛾.𝑠 𝑗 = 1 1+𝑒 −𝛾. 𝑠 𝑖−𝑠 𝑗 Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0 Computing cross-entropy between 𝑝 and 𝑝 ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗 pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
- 57. A GENERALIZED CROSS-ENTROPY LOSS An alternative loss function assumes a single relevant document 𝑑+ and compares it against the full collection 𝐷 Predicted probabilities: p 𝑑+|𝑞 = 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 The cross-entropy loss is then given by, ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 Computing the softmax over the full collection is prohibitively expensive—LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
- 58. Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
- 59. LISTWISE OBJECTIVES Burges et al. [2006] make two observations: 1. To train a model we don’t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
- 60. LISTWISE OBJECTIVES According to the Luce model [Luce, 2005], given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability of observing a particular rank-order, say 𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by: where, 𝜋 is a particular permutation and 𝜙 is a transformation (e.g., linear, exponential, or sigmoid) over the score 𝑠𝑖 corresponding to item 𝑑𝑖 R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
- 61. LISTWISE OBJECTIVES Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a “smooth” rank of documents as a function of their scores This “smooth” rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss
- 62. BREAK 10 MINS (11:20 AM - 11:30 AM)
- 63. EMBEDDINGS 45 MINS (11:30 AM - 12:15 PM)
- 64. TYPES OF VECTOR REPRESENTATIONS Local (or one-hot) representation Every term in vocabulary T is represented by a binary vector of length |T|, where one position in the vector is set to one and the rest to zero Distributed representation Every term in vocabulary T is represented by a real-valued vector of length k. The vector can be sparse or dense. The vector dimensions may be observed (e.g., hand-crafted features) or latent (e.g., embedding dimensions).
- 65. Hinton, Geoffrey E. Distributed representations. Technical Report CMU-CS-84-157, 1984
- 66. OBSERVED (OR EXPLICIT) DISTRIBUTED REPRESENTATIONS The choice of features is a key consideration The distributional hypothesis states that terms that are used (or occur) in similar context tend to be semantically similar [Harris, 1954] Firth [1957] famously purported this idea of distributional semantics by stating “a word is characterized by the company it keeps”. Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954. Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, p. 11. Blackwell, Oxford. Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010.
- 67. MINOR NOTE: SPOT THE DIFFERENCE! DISTRIBUTED REPRESENTATION Vector representations of items as combinations of different features or dimensions (as opposed to one-hot) DISTRIBUTIONAL SEMANTICS Linguistic items with similar distributions (e.g. context words) have similar meanings http://www.marekrei.com/blog/26-things-i-learned-in-the-deep-learning-summer-school/
- 68. EXAMPLE: TERM-CONTEXT VECTOR SPACE T: vocabulary, C: set of contexts, S: sparse matrix |T| x |C| (PPMI: Positive Pointwise Mutual Information) C0 c1 c2 … cj … c|C| t0 t1 t2 … ti Sij … t|T| Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010 t t t t t t t t t
- 69. EXAMPLE: SALTON’S VECTOR SPACE D: collection, T: vocabulary, S: sparse matrix |D| x |T| t0 t1 t2 … tj … t|T| d0 d1 d2 … di Sij … d|D| S G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975 idf
- 70. NOTIONS OF SIMILARITY Two terms are similar if their feature vectors are close But different feature spaces may capture different notions of similarity Is Seattle more similar to… Sydney (similar type) or Seahawks (similar topic) Depends on your choice of features
- 71. NOTIONS OF SIMILARITY Consider the following toy corpus… Now consider the different vector representations of terms you can derive from this corpus and how the items that are similar differ in these vector spaces
- 72. NOTIONS OF SIMILARITY Topical or Syntagmatic similarity
- 73. NOTIONS OF SIMILARITY Typical or Paradigmatic similarity
- 74. NOTIONS OF SIMILARITY A mix of Topical and Typical similarity
- 75. NOTIONS OF SIMILARITY Consider the following toy corpus… Now consider the different vector representations of terms you can derive from this corpus and how the items that are similar differ in these vector spaces
- 76. RETRIEVAL USING VECTOR REPRESENTATIONS Map both query and candidate documents into the same vector space Retrieve documents closest to the query e.g., using Salton’s vector space model Where, 𝑣 𝑞 and 𝑣 𝑑 are vectors of TF-IDF scores over all terms in the vocabulary G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975 𝑠𝑖𝑚 𝑞, 𝑑 = 𝑣 𝑞. 𝑣 𝑑 𝑣 𝑞 . 𝑣 𝑑
- 77. REGULARITIES IN OBSERVED FEATURE SPACES Some feature spaces capture interesting linguistic regularities e.g., simple vector algebra in the term-neighboring term space may be useful for word analogy tasks Levy, Goldberg and Ramat-Gan. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL 2014
- 78. EMBEDDINGS An embedding is a representation of items in a new space such that the properties of, and the relationships between, the items are preserved from the original representation. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
- 79. EMBEDDINGS e.g., 200-dimensional term embedding for “banana”
- 80. EMBEDDINGS Compared to observed feature spaces: • Embeddings typically have fewer dimensions • The space may have more disentangled principle components • The dimensions may be less interpretable • The latent representations may generalize better
- 81. What’s the advantage of latent vector spaces over observed features spaces?
- 82. LET’S TAKE AN IR EXAMPLE In Salton’s vector space, both these passages are equidistant from the query “Albuquerque” A latent feature representation may put the first passage closer to the query because of terms like “population” and “area” Passage about Albuquerque Passage not about Albuquerque Query: “Albuquerque”
- 83. HOW TO LEARN TERM EMBEDDINGS? Multiple approaches have been proposed for learning embeddings from <term, context, count> data Popular approaches include matrix factorization or stochastic gradient descent (SGD) C0 c1 c2 … cj … c|C| t0 t1 t2 … ti Xij … t|T|
- 84. LATENT SEMANTIC ANALYSIS (LSA) Perform SVD on X to obtain its low-rank approximation Involves finding a solution to X = 𝑈Σ𝑉T The embedding for the ith term is given by Σk 𝑡𝑖 Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990.
- 85. Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990. LATENT SEMANTIC ANALYSIS (LSA)
- 86. WORD2VEC Goal: simple (shallow) neural model learning from billion words scale corpus Predict middle word from neighbors within a fixed size context window Two different architectures: 1. Skip-gram 2. CBOW Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
- 87. SKIP-GRAM Predict neighbor 𝑡𝑖+𝑗 given term 𝑡𝑖
- 88. THE SKIP-GRAM LOSS S is the set of all windows over the training text c is the number of neighbours we need to predict on either side of the term 𝑡𝑖 Full softmax is computationally impractical - hierarchical softmax or negative sampling is employed instead
- 89. CONTINUOUS BAG-OF-WORDS (CBOW) Predict the middle term 𝑡𝑖 given {𝑡𝑖−𝑐, … , 𝑡𝑖−1, 𝑡𝑖+1, … , 𝑡𝑖+𝑐}
- 90. THE CBOW LOSS Note: from every window of text skip-gram generates 2 x c training samples whereas CBOW generates one – that’s why CBOW trains faster than skip-gram
- 91. WORD ANALOGIES WITH WORD2VEC W2v is popular for word analogy tasks But remember the same relationships also exist in the observed feature space, as we saw earlier
- 92. Let 𝑥𝑖𝑗 be the frequency of the pair 𝑡𝑖, 𝑡𝑗 in the training data, then t0 t1 t2 … tj … t|T| t0 t1 t2 … ti Xij … t|T| A MATRIX INTERPRETATION OF WORD2VEC cross-entropy error actual co-occurrence probability predicted co-occurrence probability
- 93. Replace the cross-entropy error with a squared-error and apply a saturation function f(…) over 𝑥𝑖𝑗 GLOVE Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014. ℒ 𝐺𝑙𝑜𝑉𝑒 = − 𝑖=1 |𝑇| 𝑗=1 |𝑇| 𝑓 𝑥𝑖,𝑗 𝑙𝑜𝑔 𝑥𝑖,𝑗 − 𝑤𝑖 ⊺ 𝑤𝑗 2 squared error predicted co-occurrence probability saturation function actual co-occurrence probability`
- 94. PARAGRAPH2VEC W2v style model where context is document, not neighboring term Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.
- 95. RECAP: HOW TO LEARN TERM EMBEDDINGS? Learn from <term, context, count> data Choice of context (e.g., neighboring term or container document) defines what relationship you are modeling Choice of learning algorithm (e.g., matrix factorization or SGD) defines how well you model the relationship Choice of context and learning algorithm are independent – you can use matrix factorization with neighboring term context, or a w2v-style neural network with document context (e.g., paragraph2vec)
- 96. EXAMPLES OF TEXT EMBEDDINGS Embedding for Source Item Target Item Learning Model Latent Semantic Analysis Deerwester et. al. (1990) Single word Word (one-hot) Document (one-hot) Matrix factorization Word2vec Mikolov et. al. (2013) Single Word Word (one-hot) Neighboring Word (one-hot) Neural Network (Shallow) Glove Pennington et. al. (2014) Single Word Word (one-hot) Neighboring Word (one-hot) Matrix factorization Semantic Hashing (auto-encoder) Salakhutdinov and Hinton (2007) Multi-word text Document (bag-of-words) Same as source (bag-of-words) Neural Network (Deep) DSSM Huang et. al. (2013), Shen et. al. (2014) Multi-word text Query text (bag-of-trigrams) Document title (bag-of-trigrams) Neural Network (Deep) Session DSSM Mitra (2015) Multi-word text Query text (bag-of-trigrams) Next query in session (bag-of-trigrams) Neural Network (Deep) Language Model DSSM Mitra and Craswell (2015) Multi-word text Query prefix (bag-of-trigrams) Query suffix (bag-of-trigrams) Neural Network (Deep)
- 97. DEEP NEURAL NETWORKS 45 MINS (12:15 PM - 13:00 PM)
- 98. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
- 99. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
- 100. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
- 101. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
- 102. LET’S TALK (BRIEFLY) ABOUT SUPERVISION FOR LEARNING TEXT EMBEDDINGS WITH DNNS Supervised approach Ideal if sufficiently labeled training data is available for the target retrieval task Unsupervised approach E.g., training an auto-encoder or a language model on unlabeled corpus Hybrid approach Current state-of-the-art results have employed large-scale unsupervised pretraining—followed by sufficiently large-scale supervised fine-tuning towards the target task
- 103. SIAMESE NETWORK Supervised model trained on 𝑞, 𝑑1, 𝑑2 where 𝑑1is relevant to q, but 𝑑2 is non-relevant Logistic loss is popularly used—think RankNet where 𝑠𝑖𝑚 𝑣 𝑞, 𝑣 𝑑 is the model score Typically both left and right models share similar architectures, but may also choose to share the learnable parameters
- 104. AUTOENCODER Unsupervised models trained to minimize reconstruction errors Information Bottleneck method (Tishby et al., 1999) The bottleneck layer 𝑥 captures “minimal sufficient statistics” of 𝑣 and is a compressed representation of the same
- 105. LANGUAGE MODELING A family of language modeling tasks have been explored in the literature, including: • Predict next word in a sequence • Predict masked word in a sequence • Predict next sentence Fundamentally the same idea as word2vec and older neural LMs—but with deeper models and considering dependencies across longer distances between terms w1 [MASK]w2 w4 model ? loss w3
- 106. SHIFT-INVARIANT NEURAL OPERATIONS Detecting a pattern in one part of the input space is similar to detecting it in another Leverage redundancy by moving a window over the whole input space and then aggregate On each instance of the window a kernel—also known as a filter or a cell—is applied Different aggregation strategies lead to different architectures
- 107. CONVOLUTION Move the window over the input space each time applying the same cell over the window A typical cell operation can be, ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] Cell Output [1 x out_channels] Full Output [1 + (words – window) / stride x out_channels]
- 108. POOLING Move the window over the input space each time applying an aggregate function over each dimension in within the window ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 + (words – window) / stride x channels] max -pooling average -pooling
- 109. CONVOLUTION W/ GLOBAL POOLING Stacking a global pooling layer on top of a convolutional layer is a common strategy for generating a fixed length embedding for a variable length text Full Input [words x in_channels] Full Output [1 x out_channels]
- 110. RECURRENCE Similar to a convolution layer but additional dependency on previous hidden state A simple cell operation shown below but others like LSTM and GRUs are more popular in practice, ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] + [1 x out_channels] Cell Output [1 x out_channels] Full Output [1 x out_channels]
- 111. RECURSIVE OR TREE- RNN Shared weights among all the levels of the tree Cell can be an LSTM or as simple as ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 x channels]
- 112. ATTENTION Given a set of n items and an input context, produce a probability distribution {a1, …, ai, …, an} of attending to each item as a function of similarity between a learned representation (q) of the context and learned representations (ki) of the items 𝑎𝑖 = 𝜑 𝑞, 𝑘𝑖 𝑗 𝑛 𝜑 𝑞, 𝑘𝑗 The aggregated output is given by 𝑖 𝑛 𝑎𝑖 ∙ 𝑣𝑖 Full Input [words x in_channels], [1 x ctx_channels] Full Output [1 x out_channels] * When attending over a sequence (and not a set), the key k and value v are typically a function of the item and some encoding of the position
- 113. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
- 114. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
- 115. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
- 116. RESIDUAL NETWORKS Enabled training of really deep architectures (up to 152 layers) Each layer learns the residual functions with reference to the layer inputs Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. Andreas Veit, Michael J. Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, 2016.
- 117. TRANSFORMERS A transformer layer consists of a combination of self- attention layer and multiple fully-connected or convolutional layers, with residual connections A transformer-based encoder can consist of multiple transformers stacked in sequence Full Input [words x in_channels] Full Output [words x out_channels] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
- 118. NORMALIZATION Internal covariate shift refers to the changing distribution of each layer’s inputs during training, as the parameters of the previous layers change BatchNorm and other normalization techniques achieve better training effectiveness by addressing this problem Sergey Ioffe, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Image source: https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/
- 119. REGULARIZATION The process of adding information in order to prevent overfitting. Popular approaches: • Dropout • Regularization loss • Early stopping
- 120. CONTEXTUALIZED DEEP WORD EMBEDDINGS http://jalammar.github.io/illustrated-bert/ Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.
- 121. BERT Stacked transformer layers Pretrained on two tasks: • Masked language modeling • Next sentence prediction Input: WordPiece embedding + position embedding + segment embedding Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
- 122. LUNCH 60 MINS (13:00 PM - 14:00 PM)
- 123. SHALLOW NEURAL METHODS FOR RANKING 50 MINS (14:00 PM - 14:50 PM)
- 124. THERE IS A LONG HISTORY OF VECTOR SPACE MODELS (BOTH DENSE AND SPARSE) IN IR Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. In CACM, 1975. Scott Deerwester, et. al. Indexing by latent semantic analysis. In JASIST, 1990. Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.
- 125. RETRIEVAL USING VECTOR REPRESENTATIONS Generate vector representation of query Generate vector representation of document Estimate relevance from q-d vectors
- 126. Compare query and document directly in the embedding space POPULAR APPROACHES TO INCORPORATING TERM EMBEDDINGS FOR MATCHING Use embeddings to generate suitable query expansions estimate relevance estimate relevance
- 127. E.g., Generalized Language Model [Ganguly et al., 2015] Neural Translation Language Model [Zuccon et al., 2015] Average term embeddings [Le and Mikolov, 2014, Nalisnick et al., 2016, Zamani and Croft, 2016, and others] Word mover’s distance [Kusner et al., 2015, Guo et al., 2016] Compare query and document directly in the embedding space estimate relevance
- 128. GENERALIZED LANGUAGE MODEL Traditional language modeling based IR approach may estimate q-d relevance as follows, where, 𝑝 𝑡 𝑞|𝑑 is the probability of generating term 𝑡 𝑞 from document 𝑑
- 129. GENERALIZED LANGUAGE MODEL Traditional language modeling based IR approach may estimate q-d relevance as follows, 𝑝 𝑡 𝑞|𝑑 and 𝑝 𝑡 𝑞|𝐷 are the probabilities of randomly sampling term 𝑡 𝑞 from document 𝑑 and the full collection 𝐷, respectively 𝑝 𝑡 𝑞|𝐷 has a smoothing effect on the 𝑝 𝑡 𝑞|𝑑 estimation
- 130. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
- 131. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
- 132. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
- 133. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015. Probability of generating the term from the document based on similarity in the embedding space Probability of generating the term from the full collection based on similarity in the embedding space
- 134. NEURAL TRANSLATION LANGUAGE MODEL Translation Language Model: Neural Translation Language Model: TLM estimates 𝑝 𝑡 𝑞|𝑡 𝑑 from q-d paired data similar to statistical machine translation NTLM uses term-term similarity in the embedding space to estimate 𝑝 𝑡 𝑞|𝑡 𝑑 Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015.
- 135. AVERAGE TERM EMBEDDINGS Q-D relevance estimated by computing cosine similarity between centroid of q and d term embeddings Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- 136. WORD MOVER’S DISTANCE Based on the Earth Mover’s Distance (EMD) [Rubner et al., 1998] Originally proposed by Wan et al. [2005, 2007], but used WordNet and topic categories Kusner et al. [2015] incorporated term embeddings Adapted for q-d matching by Guo et al. [2016] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In CV, 1998. Xiaojun Wan and Yuxin Peng. The earth mover’s distance as a semantic measure for document similarity. In CIKM, 2005. Xiaojun Wan. A novel document similarity measure based on earth mover’s distance. Information Sciences, 2007. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015. Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. Semantic matching by non-linear word transportation for information retrieval. In CIKM, 2016.
- 138. CHOICE OF TERM EMBEDDINGS FOR DOCUMENT RANKING RECAP: for the query “Albuquerque” the relevant document may contain terms like “population” and “area” Documents about “Santa Fe” not relevant for this query “Albuquerque” ↔ “population” (Topically similar) ✓ “Albuquerque” ↔ “Santa Fe” (Typically similar) ✗ Standard LSA and para2vec capture topical similarity, whereas w2v and GloVe capture a mix of both Top/Typ-ical Passage about Albuquerque Passage not about Albuquerque Query: “Albuquerque”
- 139. DUAL EMBEDDING SPACE MODEL What if I told you that everyone using word2vec is throwing half the model away? Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- 140. DUAL EMBEDDING SPACE MODEL Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016. IN-OUT captures a more Topical notion of similarity than IN-IN and OUT-OUT Effect is exaggerated when embeddings are trained on short text (e.g., queries)
- 141. DUAL EMBEDDING SPACE MODEL Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016. Average term embeddings model, but use IN embeddings for query terms and OUT embeddings for document terms
- 142. DUAL EMBEDDING SPACE MODEL Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- 143. IN+OUT Embeddings for 2.7M words trained on 600M+ Bing queries https://msropendata.com/datasets/30a504b0-cff2-4d4a-864f-3bc9a66f9d7e Download
- 144. RELEVANCE-BASED WORD EMBEDDING Goal: learn a word embedding that directly models a topical notion of similarity Given query q, predict term t sampled from a smoothed language model (estimated using PRF) for the query Hamed Zamani and W. Bruce Croft. Relevance-based word embedding. In SIGIR, 2017.
- 145. A TALE OF TWO QUERIES “PEKAROVIC LAND COMPANY” Hard to learn good representation for the rare term pekarovic But easy to estimate relevance based on count of exact term matches of pekarovic in the document “WHAT CHANNEL ARE THE SEAHAWKS ON TODAY” Target document likely contains ESPN or sky sports instead of channel The terms ESPN and channel can be compared in a term embedding space Matching in the term space is necessary to handle rare terms. Matching in the latent embedding space can provide additional evidence of relevance. Best performance is often achieved by combining matching in both vector spaces.
- 146. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) Besides the term “Cambridge”, other related terms (e.g., “university”, “town”, “population”, and “England”) contribute to the relevance of the passage Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- 147. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) However, the same terms may also make a passage about Oxford look somewhat relevant to the query “Cambridge” Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- 148. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) A passage about giraffes, however, obviously looks non-relevant in the embedding space… Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- 149. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) But the embedding based matching model is more robust to the same passage when “giraffe” is replaced by “Cambridge”—a trick that would fool exact term based IR models. In a sense, the embedding based model ranks this passage low because Cambridge is not "an African even- toed ungulate mammal“. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
- 150. E.g., Generalized Language Model [Ganguly et al., 2015] Neural Translation Language Model [Zuccon et al., 2015] Average term embeddings [Le and Mikolov, 2014, Nalisnick et al., 2016, Zamani and Croft, 2016, and others] Word mover’s distance [Kusner et al., 2015, Guo et al., 2016] Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015. Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015. Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014. Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Hamed Zamani and W Bruce Croft. Estimating embedding vectors for queries. In ICTIR, 2016. Compare query and document directly in the embedding space estimate relevance
- 151. Compare query and document directly in the embedding space POPULAR APPROACHES TO INCORPORATING TERM EMBEDDINGS FOR MATCHING Use embeddings to generate suitable query expansions estimate relevance estimate relevance
- 152. QUERY EXPANSION USING TERM EMBEDDINGS Use embeddings to generate suitable query expansions estimate relevance Find good expansion terms based on nearness in the embedding space Better retrieval performance when combined with pseudo-relevance feedback (PRF) [Zamani and Croft, 2016] and if we learn query specific term embeddings [Diaz et al., 2016] Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016. Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608, 2016. Hamed Zamani and W Bruce Croft. Embedding-based query language models. In ICTIR, 2016.
- 153. QUERY EXPANSION θq = UUTθq Query expansion using PRF U is the |v| x |D| term-document matrix Query expansion using word embeddings U is the |v| x k term embedding matrix Language model based IR score(d, q) = KL(θq, θd) where, θq is the query language model θd is the document language model
- 154. Word2vec is the sriracha sauce of deep learning! “
- 155. BUT A GOOD CHEF… Would prepare the appropriate sauce for each dish.
- 156. GLOBAL VS. LOCAL EMBEDDINGS Local Global tax cutting deficit squeeze vote reduce budget slash reduction reduction house spend bill lower plan halve spend soften billion freeze Nearest neighbors of the word “cut” (as in “gas cut”) in the embedding space Uglobal embedding trained with P(w|C) Ulocal embedding trained with P(w|R) Where, C is the whole document corpus R is the set of relevant documents only
- 157. QUERY EXPANSION USING GLOBAL AND LOCAL WORD EMBEDDINGS Each point represents a candidate expansion term Red points have high frequency in the relevant set of documents White points have low or no frequency in the relevant set of documents The blue point represents the query. Contours indicate distance from the query global local Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016.
- 158. DEEP NEURAL METHODS FOR RANKING 90 MINS (14:50 PM - 16:20 PM)
- 159. SEMANTIC HASHING Document autoencoder minimizing reconstruction error Input: word counts (vocab size = 2K) Output: binary vector Stacked RBMs w/ layer-by-layer pre- training followed by E2E tuning Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.
- 160. DEEP SEMANTIC SIMILARITY MODEL (DSSM) Siamese network trained E2E on query and document title pairs Relevance is estimated by cosine similarity between query and document embeddings Input: character trigraph counts (bag of words assumption) Minimizes cross-entropy loss against randomly sampled negative documents Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
- 161. CONVOLUTIONAL DSSM (CDSSM) Replace bag-of-words assumption by concatenating term vectors in a sequence on the input Convolution followed by global max-pooling Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
- 162. REMEMBER… …how different embedding spaces capture different notions of similarity?
- 163. DSSM TRAINED ON DIFFERENT TYPES OF DATA Trained on pairs of… Sample training data Useful for? Paper Query and document titles <“things to do in seattle”, “seattle tourist attractions”> Document ranking (Shen et al., 2014) https://dl.acm.org/citation... Query prefix and suffix <“things to do in”, “seattle”> Query auto-completion (Mitra and Craswell, 2015) https://dl.acm.org/citation... Consecutive queries in user sessions <“things to do in seattle”, “space needle”> Next query suggestion (Mitra, 2015) https://dl.acm.org/citation... Each model captures a different notion of similarity (or regularity) in the learnt embedding space Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015. Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
- 164. Nearest neighbors for “seattle” and “taylor swift” based on two DSSM models – one trained on query-document pairs and the other trained on query prefix-suffix pairs DIFFERENT REGULARITIES IN DIFFERENT EMBEDDING SPACES Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015.
- 165. DIFFERENT REGULARITIES IN DIFFERENT EMBEDDING SPACES Groups of similar search intent transitions from a query log The DSSM trained on session query pairs can capture regularities in the query space (similar to word2vec for terms) Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
- 166. DSSM TRAINED ON SESSION QUERY PAIRS ALLOWS FOR ANALOGIES OVER SHORT TEXT! Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
- 167. INTERACTION-BASED NETWORKS Typically a document is relevant if some part of the document contains information relevant to the query Interaction matrix 𝑋—where 𝑥𝑖𝑗 is obtained by comparing the ith window over query terms with the jth window over the document terms—captures evidence of relevance from different parts of the document Additional neural network layers can inspect the interaction matrix and aggregate the evidence to estimate overall relevance Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, 2013.
- 168. REMEMBER… …the important of incorporating exact term matches as well as matches in the latent space for estimating relevance?
- 169. LEXICAL AND SEMANTIC MATCHING NETWORKS Mitra et al. [2016] argue that both lexical and semantic matching is important for document ranking Duet model is a linear combination of two DNNs—focusing on lexical and semantic matching, respectively—jointly trained on labelled data Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
- 170. LEXICAL AND SEMANTIC MATCHING NETWORKS Lexical sub-model operates over input matrix 𝑋 𝑥𝑖,𝑗 = 1, 𝑖𝑓 𝑡 𝑞,𝑖 = 𝑡 𝑑,𝑗 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 In relevant documents, 1. Many matches, typically in clusters 2. Matches localized early in document 3. Matches for all query terms 4. In-order (phrasal) matches Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
- 171. LEXICAL AND SEMANTIC MATCHING NETWORKS Convolve using window of size 𝑛 𝑑 × 1 Each window instance compares a query term w/ whole document Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
- 172. LEXICAL AND SEMANTIC MATCHING NETWORKS Semantic sub-model matches in the latent embedding space Match query with moving windows over document Learn text embeddings specifically for the task Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
- 173. BIG VS. SMALL DATA REGIMES Big data seems to be more crucial for models that focus on good representation learning for text Partial supervision strategies (e.g., unsupervised pre-training of word embeddings) can be effective but may be leaving the bigger gains on the table Learning to train on unlabeled data may be key to making progress on neural ad-hoc retrieval Which IR models are similar? Clustering based on query level retrieval performance. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
- 174. Duet implementation on PyTorch https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb GET THE CODE
- 175. WIDE AND DEEP MODEL Deep model for representation learning and wide model for memorization Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, et al. Wide & deep learning for recommender systems. In workshop on deep learning for recommender systems, 2016.
- 176. KERNEL POOLING Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In SIGIR, 2017. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In WSDM, 2018.
- 177. WEB DOCUMENTS ARE MORE THAN JUST BODY TEXT… URL incoming anchor text title body clicked query
- 178. EXTENDING NEURAL RANKING MODELS TO MULTIPLE DOCUMENT FIELDS BM25 Neural ranking model → → BM25F ?
- 179. RANKING DOCUMENTS WITH MULTIPLE FIELDS Learn different embedding space for each document field Different fields may match different aspects of the query—learn different query embeddings for matching against different fields Represent per field match by a vector, not a score Field level dropout during training can regularize against over-dependency on any individual field Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
- 180. Learn a different embedding space for each document field Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
- 181. For multiple-instance fields, average pool the instance level embeddings Mask empty text instances, and average only among non-empty instances to avoid preferring documents with more instances Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
- 182. Learn different query embeddings for matching against different fields Different fields may match different aspects of the query Ideal query representation for matching against URL likely to be different from for matching with title Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
- 183. Represent per field match by a vector, not a score Allows the model to validate that across the different fields all aspects of the query intent have been covered (Similar intuition as BM25F) Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
- 184. Aggregate evidence of relevance across all document fields Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
- 185. High precision fields, such as clicked queries, can negatively impact the modeling of the other fields Field level dropout during training can regularize against over-dependency on any individual field Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
- 186. MANY OTHER NEURAL ARCHITECTURES (Palangi et al., 2015) (Kalchbrenner et al., 2014) (Denil et al., 2014) (Kim, 2014) (Severyn and Moschitti, 2015) (Zhao et al., 2015) (Hu et al., 2014) (Tai et al., 2015) (Guo et al., 2016) (Hui et al., 2017) (Pang et al., 2017) (Jaech et al., 2017) (Dehghani et al., 2017)
- 187. BERT FOR RANKING BERT (and other large-scale unsupervised language models) are demonstrating dramatic performance improvements on many IR tasks Rodrigo Nogueira, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019. MS MARCO Query Passage Pair Query Passage score
- 188. Impact across both academia and industry BERT FOR RANKING
- 189. WHAT DID YOUR MODEL REALLY LEARN? While we celebrate the recent performance bumps on IR tasks from neural methods, it is also important to recognize when and how they fail
- 190. Clever Hans was a horse claimed to have been capable of performing arithmetic and other intellectual tasks. "If the eighth day of the month comes on a Tuesday, what is the date of the following Friday?“ Hans would answer by tapping his hoof. In fact, the horse was purported to have been responding directly to involuntary cues in the body language of the human trainer, who had the faculties to solve each problem. The trainer was entirely unaware that he was providing such cues. (source: Wikipedia)
- 191. BM25 vs. Inverse document frequency of terms( ) BERT Language model of term co-occurrences( ) What corpus statistics does your model depend on?
- 192. WHAT CHANGED BETWEEN TRAIN AND TEST? Terms often change meaning across domains or over time Robust retrieval performance is important (e.g., enterprise search across multiple tenants) TodayRecentIn older (1990s) TREC data Query: uk prime minister
- 193. domain A domain B domain C domain X training domains test domain OPTIMIZING FOR CROSS DOMAIN PERFORMANCE
- 194. Baseline model projects query and document to latent space for matching Additional fully-connected layers to estimate relevance Hidden layers may encode domain specific statistics convolution and pooling layers convolution and pooling layers hadamard product dense layers 𝑦 query doc How do we encourage the model to only learn features that generalize across multiple domains? OPTIMIZING FOR CROSS DOMAIN PERFORMANCE Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
- 195. OPTIMIZING FOR CROSS DOMAIN PERFORMANCE Train model on multiple domains During training, an adversarial discriminator inspects the hidden states of the model and tries to predict the source corpus of the training sample convolution and pooling layers convolution and pooling layers hadamard product dense layers adversarial discriminator (dense) 𝑧 𝑦 query doc The duet model, in addition to optimizing for the ranking loss, also tries to “fool” the adversarial discriminator – and in the process learns more domain independent representations Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
- 196. ADDITIONAL REGULARIZATION FOR THE RANKING LOSS Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
- 197. query relevant document non-relevant document parameters of the adversarial discriminator parameters of the ranking model Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018. ADDITIONAL REGULARIZATION FOR THE RANKING LOSS
- 198. Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018. ADDITIONAL REGULARIZATION FOR THE RANKING LOSS
- 199. Reverse the gradient from the discriminator when back-propagating through the ranking model convolution and pooling layers convolution and pooling layers hadamard product dense layers adversarial discriminator (dense) 𝑧 𝑦 query doc ≈ ≈ Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018. GRADIENT REVERSAL
- 200. Adversarial regularization may also be useful for mitigating bias
- 201. MARRYING THE OLD WITH THE NEW
- 202. (SIGIR ’94) (SIGIR ’04) (SIGIR ’05)
- 205. CONNECTION TO NEURAL RANKER TRAINING Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.
- 206. AXIOMATIC REGULARIZATION FOR NEURAL RANKER Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.
- 207. BREAK 10 MINS (16:20 PM - 16:30 PM)
- 208. BEYOND RERANKING 45 MINS (16:30 PM - 17:15 PM)
- 209. RETRIEVING, NOT JUST RERANKING, WITH DEEP NEURAL NETWORKS Deep ranking models are compute- intensive and are practically employed only to rerank top-k candidates retrieved by more efficient traditional IR methods IR performances may be significantly more impacted if we can also use them for candidate generation score
- 210. OPTION 1: QUERY INDEPENDENT DOCUMENT REPRESENTATION Employ a Siamese network architecture Compute document representations offline and query representation at inference time Efficient online but large offline computation cost Effectiveness degrades without interaction features and lexical term matching score
- 211. APPROXIMATE K-NN SEARCH Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. Off the beaten path: Let's replace term-based retrieval with k-nn search. In CIKM, 2016.
- 212. LEARNING SPARSE VECTOR REPRESENTATIONS Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009. Hamed Zamani, et al. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In CIKM, 2018.
- 213. FAST APPROX. K-NN SEARCH WITH ANNOY https://github.com/spotify/annoy
- 214. Efficient online but large offline computation cost Can scale to tail queries but at higher computation cost—we can trade-off the two experimentally OPTION 2: ASSUME QUERY TERM INDEPENDENCE ASSUMPTION Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
- 215. WE TYPICALLY LEARN THE PARAMETERS OF THE MODEL BY MINIMIZING SOME PAIRWISE LOSS… e.g., Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
- 216. Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
- 217. TERM-DOCUMENT IMPACT SCORES d1 d2 d3 d4 d5 t1 t2 t3 t4 t5 If the IR model assumes query term independence, we can precompute all term-document impact scores The matrix is generally sparse, either by definition or by enforcing additional sparsity constraints (e.g., assume only terms that appear in the document have non-zero impact) Precomputed scores can be used with inverted index for fast retrieval from large collections Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
- 218. NEURAL RANKING MODEL WITH QUERY TERM INDEPENDENCE ASSUMPTION score Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
- 219. Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
- 220. THE EFFECTIVENESS-EFFICIENCY TRADE-OFF The model does not have the context of the full query which may result in reduced effectiveness However, now we can precompute everything and use the learned model in a full retrieval setting! Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
- 221. HOW SERIOUS IS THE LOSS IN EFFECTIVENESS FROM ASSUMING QUERY TERM INDEPENDENCE? Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019. Reranking evaluation Full retrieval evaluation (on a smaller set of queries than previous table)
- 222. DOCUMENT EXPANSION Similar to query term independence approach in that they are both trying to learn a better document language model The training objective here, however, is to predict relevant queries and not the target ranking metric we care about Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document Expansion by Query Prediction. In arXiv, 2019.
- 223. TRADING-OFF SEARCH RESULT QUALITY AND QUERY RESPONSE TIME IN LARGE SCALE IR SYSTEMS In Bing, we have a candidate generation stage followed by multiple rank and prune stages Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 224. In Bing, the index is distributed over multiple machines For candidate generation, on each machine the documents are linearly scanned using a match plan Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 225. When a query comes in, it is automatically categorized and a pre-defined match plan is selected A match plan consists of a sequence of match rules, and corresponding stopping criteria A match rule defines the condition that a document should satisfy to be selected as a candidate The stopping criteria decides when the index scan using a particular match rule should terminate—and if the matching process should continue with the next match rule, or conclude, or reset to the beginning of the index Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 226. Match plans influence the trade-off between effectiveness and efficiency E.g., long queries with rare intents may require expensive match plans that consider body text and search deeper into the index In contrast, for popular navigational queries a shallow scan against URL and title metastreams may be sufficient Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 227. E.g., Query: halloween costumes Match rule: mrA → (halloween ∈ A|U|B|T ) ∧ (costumes ∈ A|U|B|T ) Query: facebook login Match rule: mrB → (facebook ∈ U|T ) Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 228. During execution, two accumulators are tracked u: the number of blocks accessed from disk v: the cum. number of term matches in all inspected documents A stopping criteria sets thresholds for each – when either thresholds are met, the scan using that particular match rule terminates Matching may then continue with a new match rule, or terminate, or re-start from beginning Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 229. TYPICALLY THESE MATCH PLANS ARE HAND- CRAFTED AND STATICALLY ASSIGNED TO DIFFERENT QUERY CATEGORIES WE CAN CAST THE MATCH PLANNING AS A REINFORCEMENT LEARNING TASK! Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 230. REINFORCEMENT LEARNING environment action reward agent state Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 231. (for Bing candidate generation) index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 232. (for Bing candidate generation) Learn a policy πθ : S → A which maximizes the cumulative discounted reward R Where, γ is the discount rate index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 233. (for Bing candidate generation) We use table based Q learning State space: discrete <ut, vt> Action space: index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 234. (for Bing candidate generation) Reward function: g(di) is the relevance of the ith document estimated based on the subsequent L1 ranker score— considering only top n documents index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 235. (for Bing candidate generation) Final reward: If no new documents are selected, we assign a small negative reward index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 236. RESULTS Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
- 237. DEEP LEARNING @ TREC 15 MINS (17:15 PM - 17:30 PM)
- 238. GOAL: LARGE, HUMAN-LABELED, OPEN IR DATA 200K queries, human-labeled, proprietary Past: Weak supervision Here: Two new datasetsPast: Proprietary data 1+M queries, weak supervision, open 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 Dehghani, Zamani, Severyn, Kamps and Croft. Neural ranking models with weak supervision. SIGIR 2017 More data Bettersearchresults TREC 2019 Deep Learning Track
- 239. GENERATING PUBLIC BENCHMARKS FOR NEURAL IR RESEARCH A public retrieval and ranking benchmark with large scale training data (~400K queries with manual relevance labels)
- 240. DERIVING OUR TREC 2019 DATASETS MS MARCO QnA Leaderboard • 1M real queries • 10 passages per Q • Human annotation says ~1 of 10 answers the query MS MARCO Passage Retrieval Leaderboard • Corpus: Union of 10-passage sets • Labels: From the ~1 positive passage TREC 2019 Task: Passage Retrieval • Same corpus, training Q+labels • New reusable NIST test set TREC 2019 Task: Document Retrieval • Corpus: Documents (crawl passage urls) • Labels: Transfer from passage to doc • New reusable NIST test set http://msmarco.org https://microsoft.github.io/TREC-2019-Deep-Learning/
- 241. SETUP OF THE 2019 DEEP LEARNING TRACK • Key question: What works best in a large-data regime? • “nnlm”: Runs that use a BERT-style language model • “nn”: Runs that do representation learning • “trad”: Runs using only traditional IR features (such as BM25 and RM3) • Subtasks: • “fullrank”: End-to-end retrieval • “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25. Task Training data Test data Corpus 1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents 2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass labels 8.8M passages * Mostly-overlapping query sets (41 shared)
- 242. DATASET AVAILABILITY • Corpus + train + dev data for both tasks available now from the DL Track site* • NIST test sets available to participants now • [Broader availability in Feb 2020] * https://microsoft.github.io/TREC-2019-Deep-Learning/
- 243. SUMMARY OF TREC 2019 DEEP LEARNING TRACK RESULTS
- 245. Download the slides: http://bit.ly/dl4search-fire2019 Download the free book: http://bit.ly/neuralir-intro Download TREC Deep Learning Track data: https://microsoft.github.io/TREC-2019-Deep-Learning/ @UnderdogGeek bmitra@microsoft.com THANK YOU

- Local representation Distributed representation One dimension for “banana” “banana” is a pattern Brittle under noise More robust to noise Precise Near “mango”, “pineapple”. (Nuanced) Add vocab Add dimensions Add vocab Generate more vectors K dimensions K items K dimensions 2k “regions”
- Clever Hans was a horse. It was claimed that he could do simple arithmetic. If you asked Hans a question he would respond by tapping his hoof. After a thorough investigation, it was, however, determined that what Clever Hans was really good at was at reading very subtle and, in fact, unintentional clues that his trainer was giving him via his body language. Hans didn’t know arithmetic at all. But he was very good at spotting body language that CORRELATED highly with the right answer.
- A traditional IR model, such as BM25, makes very few assumptions about the target collection. You can argue that the inverse document frequencies (and couple of the BM25 hyper-parameters) are all that you would learn from your collection. Which is why you can throw BM25 at most retrieval task (e.g., TREC or Web ranking in Bing) and it would give you pretty reasonable performance in most cases out-of-the-box. On the other hand, take a deep neural model and train it on Bing Web ranking task and then evaluate it on TREC data and I bet it falls flat on its face.
- This is an important problem. Think about an enterprise search solution that needs to cater to a large number of tenants. You train your model on only a few tenants—either because of privacy constraints or because most tenants are too small and you don’t have enough training data for the others. But afterwards you need to deploy the same model to all the tenants. Good cross domain performance would be key in such a setting. How can we make these deep and large machine learning models—with all their lovely bells and whistles—as robust as a simple BM25 baseline?