Successfully reported this slideshow.
Your SlideShare is downloading. ×

Overview of the TREC 2019 Deep Learning Track

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 25 Ad

More Related Content

Similar to Overview of the TREC 2019 Deep Learning Track (20)

Advertisement

Recently uploaded (20)

Overview of the TREC 2019 Deep Learning Track

  1. 1. Nick Craswell Microsoft Bhaskar Mitra Microsoft, UCL Emine Yilmaz UCL Daniel Campos Microsoft
  2. 2. Goal: Large, human-labeled, open IR data 200K queries, human-labeled, proprietary Past: Weak supervision Here: Two new datasetsPast: Proprietary data 1+M queries, weak supervision, open 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 Dehghani, Zamani, Severyn, Kamps and Croft. Neural ranking models with weak supervision. SIGIR 2017 More data Bettersearchresults TREC 2019 Deep Learning Track
  3. 3. Deriving our TREC 2019 datasets MS MARCO QnA Leaderboard • 1M real queries • 10 passages per Q • Human annotation says ~1 of 10 answers the query MS MARCO Passage Retrieval Leaderboard • Corpus: Union of 10-passage sets • Labels: From the ~1 positive passage TREC 2019 Task: Passage Retrieval • Same corpus, training Q+labels • New reusable NIST test set TREC 2019 Task: Document Retrieval • Corpus: Documents (crawl passage urls) • Labels: Transfer from passage to doc • New reusable NIST test set http://msmarco.org https://microsoft.github.io/TREC-2019-Deep-Learning/
  4. 4. Setup of the 2019 deep learning track • Key question: What works best in a large-data regime? • “nnlm”: Runs that use a BERT-style language model • “nn”: Runs that do representation learning • “trad”: Runs using only traditional IR features (such as BM25 and RM3) • Subtasks: • “fullrank”: End-to-end retrieval • “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25. Task Training data Test data Corpus 1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents 2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass labels 8.8M passages * Mostly-overlapping query sets (41 shared)
  5. 5. Dataset availability • Corpus+train+dev data for both tasks available now from the DL Track site* • NIST test sets available to participants now • [Broader availability in Feb 2020] * https://microsoft.github.io/TREC-2019-Deep-Learning/
  6. 6. Let’s talk: Baselines, bias, overfitting, replicability • Our 2019 test sets can be reused in future papers • Judging is sufficiently complete. Diverse pools. HiCAL. • Risk: People make decisions using the test set (i.e. overfit) • Safer: Submit to TREC 2020, to really prove your point • One-shot submission, before labels even exist • Submit runs, or even better, submit docker • A good TREC track has: • Many types of model (no cherrypicked comparisons) • With proper optimization (no straw men) • And full reporting of results (no publication bias) • On an unseen test set (no overfitting)
  7. 7. Participation
  8. 8. Popular in 2019: Transfer learning Can be used in “nnlm” and “nn” runs: • “nnlm” if the pretrained model is BERT or similar • “nn” if the pretrained model is word2vec or similarhttps://ruder.io/state-of-transfer-learning-in-nlp/ Our large IR data Information retrieval system
  9. 9. • Official NIST test set: Document task • Official NIST test set: Passage task rel #pass % 3 697 8% 2 1804 19% 1 1601 17% 0 5158 56% rel #doc % 3 841 5% 2 1149 7% 1 4607 28% 0 9661 59% For metrics that need binary labels, binarize at:
  10. 10. Document retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Significant wins over “trad”
  11. 11. Document retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Significant wins over “trad”
  12. 12. Queries
  13. 13. Passage retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Even more significant wins over “trad”
  14. 14. Passage retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Even more significant wins over “trad”
  15. 15. Queries
  16. 16. Subtasks: “fullrank” vs “rerank” • Many IR systems are multi-stage. For example  • “rerank” subtask: First stage is shared • Document task: Rerank top-100 Indri QL • Passage task: Rerank top-1000 BM25 • Advantages: Easier to participate. Reduces variability. • “fullrank” subtask: End-to-end retrieval • Document task: Retrieve from 3.2M • Passage task: Retrieve from 8.8M • Advantages: Additional relevant results. Align stages. Invent a single-stage end-to-end approach.
  17. 17. Metrics analysis: Passage ranking
  18. 18. Metrics analysis: Passage ranking
  19. 19. Metrics analysis: Document ranking
  20. 20. Metrics analysis: Document ranking
  21. 21. Conclusion • Two large datasets with 300+K training queries • Reusable NIST test sets • “nnlm” does well • “fullrank” and “rerank” are not that different this year • DL Track 2020. Repeat. Continue work on large data, transfer learning
  22. 22. Real-world implications • BM25 is 25 [introduced in TREC-3] • Used in many TRECs • Used in many products (“BM25F” too) • Robust and easy to use • “nnlm” • Used in one TREC. Soundly beaten next year? • Used in products, perhaps? • Robust? Easy to use? Transferrable? • What is your go-to ranker today? • What is your go-to ranker in 10 years?

×