Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Duet model


Published on

Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.

Published in: Technology

The Duet model

  1. 1. Learning to Match Using Local and Distributed Representations of Text for Web Search Nick Craswell Microsoft Bellevue, USA *work done while at Microsoft Fernando Diaz Spotify* New York, USA Bhaskar Mitra Microsoft, UCL Cambridge, UK The Duet Model:
  2. 2. The document ranking task Given a query rank documents according to relevance The query text has few terms The document representation can be long (e.g., body text) or short (e.g., title) query ranked results search engine w/ an index of retrievable items
  3. 3. This paper is focused on ranking documents based on their long body text
  4. 4. Many DNN models for short text ranking (Huang et al., 2013) (Severyn and Moschitti, 2015) (Shen et al., 2014) (Palangi et al., 2015) (Hu et al., 2014) (Tai et al., 2015)
  5. 5. But few for long document ranking… (Guo et al., 2016) (Salakhutdinov and Hinton, 2009)
  6. 6. Challenges in short vs. long text retrieval Short-text Vocabulary mismatch more serious problem Long-text Documents contain mixture of many topics Matches in different parts of the document non-uniformly important Term proximity is important
  7. 7. The “black swans” of Information Retrieval The term black swan originally referred to impossible events. In 1697, Dutch explorers encountered black swans for the very first time in western Australia. Since then, the term is used to refer to surprisingly rare events. In IR, many query terms and intents are never observed in the training data Exact matching is effective in making the IR model robust to rare events
  8. 8. Desiderata of document ranking Exact matching Important if query term is rare / fresh Frequency and positions of matches good indicators of relevance Term proximity is important Inexact matching Synonymy relationships united states president ↔ Obama Evidence for document aboutness Documents about Australia likely to contain related terms like Sydney and koala Proximity and position is important
  9. 9. Different text representations for matching Local representation Terms are considered distinct entities Term representation is local (one-hot vectors) Matching is exact (term-level) Distributed representation Represent text as dense vectors (embeddings) Inexact matching in the embedding space Local (one-hot) representation Distributed representation
  10. 10. A tale of two queries “pekarovic land company” Hard to learn good representation for rare term pekarovic But easy to estimate relevance based on patterns of exact matches Proposal: Learn a neural model to estimate relevance from patterns of exact matches “what channel are the seahawks on today” Target document likely contains ESPN or sky sports instead of channel An embedding model can associate ESPN in document to channel in query Proposal: Learn embeddings of text and match query with document in the embedding space The Duet Architecture Use a neural network to model both functions and learn their parameters jointly
  11. 11. The Duet architecture Linear combination of two models trained jointly on labelled query- document pairs Local model operates on lexical interaction matrix Distributed model projects n-graph vectors of text into an embedding space and then estimates match Sum Query text Generate query term vector Doc text Generate doc term vector Generate interaction matrix Query term vector Doc term vector Local model Fully connected layers for matching Query text Generate query embedding Doc text Generate doc embedding Hadamard product Query embedding Doc embedding Distributed model Fully connected layers for matching
  12. 12. Local model
  13. 13. Local model: term interaction matrix 𝑋𝑖,𝑗 = 1, 𝑖𝑓 𝑞𝑖 = 𝑑𝑗 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 In relevant documents, →Many matches, typically clustered →Matches localized early in document →Matches for all query terms →In-order (phrasal) matches
  14. 14. Local model: estimating relevance ← document words → Convolve using window of size 𝑛 𝑑 × 1 Each window instance compares a query term w/ whole document Fully connected layers aggregate evidence across query terms - can model phrasal matches
  15. 15. Distributed model
  16. 16. Distributed model: input representation dogs → [ d , o , g , s , #d , do , og , gs , s# , #do , dog , ogs , gs#, #dog, dogs, ogs#, #dogs, dogs# ] (we consider 2K most popular n-graphs only for encoding) d o g s h a v e o w n e r s c a t s h a v e s t a f f n-graph encoding concatenate Channels=2K [words x channels]
  17. 17. convolutio n pooling Query embedding … … … HadamardproductHadamardproductFullyconnected query document Distributed model: estimating relevance Convolve over query and document terms Match query with moving windows over document Learn text embeddings specifically for the task Matching happens in embedding space * Network architecture slightly simplified for visualization – refer paper for exact details
  18. 18. Putting the two models together…
  19. 19. The Duet model Training sample: 𝑄, 𝐷+, 𝐷1 − 𝐷2 − 𝐷3 − 𝐷4 − 𝐷+ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑎𝑡𝑒𝑑 𝐸𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑜𝑟 𝐺𝑜𝑜𝑑 𝐷− = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 2 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 𝑤𝑜𝑟𝑠𝑒 𝑡ℎ𝑎𝑛 𝐷+ Optimize cross-entropy loss Implemented using CNTK (GitHub link)
  20. 20. Data Need large-scale training data (labels or clicks) We use Bing human labelled data for both train and test
  21. 21. Results Key finding: Duet performs significantly better than local and distributed models trained individually
  22. 22. Random negatives vs. judged negatives Key finding: training w/ judged bad as negatives significantly better than w/ random negatives
  23. 23. Local vs. distributed model Key finding: local and distributed model performs better on different segments, but combination is always better
  24. 24. Effect of training data volume Key finding: large quantity of training data necessary for learning good representations, less impactful for training local model
  25. 25. Term importance Local model Only query terms have an impact Earlier occurrences have bigger impact Query: united states president Visualizing impact of dropping terms on model score
  26. 26. Term importance Distributed model Non-query terms (e.g., Obama and federal) has positive impact on score Common words like ‘the’ and ‘of’ probably good indicators of well- formedness of content Query: united states president Visualizing impact of dropping terms on model score
  27. 27. Types of models If we classify models by query level performance there is a clear clustering of lexical (local) and semantic (distributed) models
  28. 28. Duet on other IR tasks Promising early results on TREC 2017 Complex Answer Retrieval (TREC-CAR) Duet performs significantly better when trained on large data (~32 million samples) (PAPER UNDER REVIEW)
  29. 29. Summary Both exact and inexact matching is important for IR Deep neural networks can be used to model both types of matching Local model more effective for queries containing rare terms Distributed model benefits from training on large datasets Combine local and distributed model to achieve state-of-the-art performance Get the model: