SlideShare a Scribd company logo
1 of 13
Benchmarking
search relevance
in industry vs
academia
Nick Craswell
Principal Group Science Manager
Microsoft WebXT
Benchmarking search relevance
• Search task: Retrieve documents in response to a query
• Benchmark data: Queries, Corpus, Judgments (a test collection)
• Application-specific benchmarks -> Lots of room for optimization+ML e.g.
incorporating temporal factors in a news search product
• Core IR benchmarks (flat Q, flat D) -> Not always making progress?*
• Core IR task is important
• Unsolved. Fundamental. Building block
• Need benchmarks to encourage progress
* Armstrong, Moffatt, Webber, Zobel.
Improvements That Don’t Add Up:
Ad-Hoc Retrieval Results Since 1998.
CIKM 2009
What does progress look like?
Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. "SMART high precision: TREC 7." NIST Special Publication 500-242 TREC-7 (1999)
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
AveragePrecision
Progress
TREC-1 system (1992)
TREC-7 system (1998)
Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
A. Avoiding test data reuse
• Using multiple querysets in industry
• Make many decisions using queryset 1, few on 2, none on 3
• Refresh querysets often
• Academia: 1) Multiple test collections, 2) Leaderboards can reduce
iteration, 3) Most convincing is one-time submission (e.g. TREC)
• Thought experiment:
Queryset 1:
Find an improvement
Queryset 2: Choose a
release candidate
Queryset 3: Post-
release measurement
B. Production baseline
• Evaluate production ranker changes, which we want to deploy
• Pro: Avoid the weak baseline problem
• Con: Repeated incremental improvements increase complexity
• Pro: Improvements can add up
• Academic options:
• Not sure!
• Winners at TREC/leaderboard may be lucky. Strongest baseline is also lucky
• I would trust a high-ish baseline with SS gains e.g. two runs from one group
Ben Carterette. 2015. The Best Published Result is Random: Sequential
Testing and Its Effect on Reported Effectiveness. In SIGIR ’15.
C. Get more data
200K queries, human-labeled, proprietary
Academic data release:
MS MARCO and TREC DL
In industry
300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
More data
Bettersearchresults
DNN vs 1990s IR
Artist’s impression of total victory
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
Blind
Test
AveragePrecision
TREC-1 SMART
TREC-2 SMART
TREC-3 SMART
TREC-4 SMART
TREC-5 SMART
TREC-6 SMART
TREC-7 SMART
TREC-26+ DNN
Nick Craswell. Neural Models for Full Text Search: Could the
improvements add up? WSDM 2017 Practice and Experience Talk
• We decided to release data: Labels, clicks, etc
• Public leaderboard and TREC track (and code)
• Part of a larger open effort “AI at Scale”
Our external ranking benchmarks
TREC Deep Learning Track
https://msmarco.org
BM25
BERT
Leader
Conclusion: Industry perspective on academia
• Reusing test collections a lot is not something we’d advise
• Are you sure you made no decisions based on robust04
• What if you had another robust04. Would your conclusions stand up?
• Submit to TREC, this is the most reliable way of avoiding overfitting
• With large training data we can significantly beat 1990s methods on
core IR tasks e.g. BERT-style DNN rankers
• Not sure how to handle baselines in academia
• Would trust an experiment where baseline is not too low and there’s a gain

More Related Content

Similar to Benchmarking search relevance in industry vs academia

Similar to Benchmarking search relevance in industry vs academia (20)

how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept project
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Learning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsLearning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestions
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 

Recently uploaded

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Recently uploaded (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 

Benchmarking search relevance in industry vs academia

  • 1. Benchmarking search relevance in industry vs academia Nick Craswell Principal Group Science Manager Microsoft WebXT
  • 2. Benchmarking search relevance • Search task: Retrieve documents in response to a query • Benchmark data: Queries, Corpus, Judgments (a test collection) • Application-specific benchmarks -> Lots of room for optimization+ML e.g. incorporating temporal factors in a news search product • Core IR benchmarks (flat Q, flat D) -> Not always making progress?* • Core IR task is important • Unsolved. Fundamental. Building block • Need benchmarks to encourage progress * Armstrong, Moffatt, Webber, Zobel. Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since 1998. CIKM 2009
  • 3. What does progress look like? Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. "SMART high precision: TREC 7." NIST Special Publication 500-242 TREC-7 (1999) 0 0.1 0.2 0.3 0.4 0.5 0.6 TREC-1 Task TREC-2 Task TREC-3 Task TREC-4 Task TREC-5 Task TREC-6 Task TD TREC-6 Task D TREC-7 Task AveragePrecision Progress TREC-1 system (1992) TREC-7 system (1998)
  • 4. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data
  • 5. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data
  • 6. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data
  • 7. A. Avoiding test data reuse • Using multiple querysets in industry • Make many decisions using queryset 1, few on 2, none on 3 • Refresh querysets often • Academia: 1) Multiple test collections, 2) Leaderboards can reduce iteration, 3) Most convincing is one-time submission (e.g. TREC) • Thought experiment: Queryset 1: Find an improvement Queryset 2: Choose a release candidate Queryset 3: Post- release measurement
  • 8. B. Production baseline • Evaluate production ranker changes, which we want to deploy • Pro: Avoid the weak baseline problem • Con: Repeated incremental improvements increase complexity • Pro: Improvements can add up • Academic options: • Not sure! • Winners at TREC/leaderboard may be lucky. Strongest baseline is also lucky • I would trust a high-ish baseline with SS gains e.g. two runs from one group Ben Carterette. 2015. The Best Published Result is Random: Sequential Testing and Its Effect on Reported Effectiveness. In SIGIR ’15.
  • 9. C. Get more data 200K queries, human-labeled, proprietary Academic data release: MS MARCO and TREC DL In industry 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 More data Bettersearchresults
  • 10. DNN vs 1990s IR Artist’s impression of total victory 0 0.1 0.2 0.3 0.4 0.5 0.6 TREC-1 Task TREC-2 Task TREC-3 Task TREC-4 Task TREC-5 Task TREC-6 Task TD TREC-6 Task D TREC-7 Task Blind Test AveragePrecision TREC-1 SMART TREC-2 SMART TREC-3 SMART TREC-4 SMART TREC-5 SMART TREC-6 SMART TREC-7 SMART TREC-26+ DNN Nick Craswell. Neural Models for Full Text Search: Could the improvements add up? WSDM 2017 Practice and Experience Talk
  • 11. • We decided to release data: Labels, clicks, etc • Public leaderboard and TREC track (and code) • Part of a larger open effort “AI at Scale”
  • 12. Our external ranking benchmarks TREC Deep Learning Track https://msmarco.org BM25 BERT Leader
  • 13. Conclusion: Industry perspective on academia • Reusing test collections a lot is not something we’d advise • Are you sure you made no decisions based on robust04 • What if you had another robust04. Would your conclusions stand up? • Submit to TREC, this is the most reliable way of avoiding overfitting • With large training data we can significantly beat 1990s methods on core IR tasks e.g. BERT-style DNN rankers • Not sure how to handle baselines in academia • Would trust an experiment where baseline is not too low and there’s a gain