Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West

OpenSource Connections
OpenSource ConnectionsPrincipal, OpenSource Connections and Solr Consultant at OpenSource Connections
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West
• HathiTrust is a shared digital repository
• 140+ member libraries
• Large Scale Search is one of many services built on top of the repository
• Currently about 17 million scanned and OCRed books (5 billion pages)
• About 1 Petabyte of data
• Preservation page images;jpeg 2000, tiff (910TB+)
• OCR and Metadata about (16TB)
HathiTrust
2
• Goal: Design a system for full-text search that will scale to 20 million
volumes (at a reasonable cost.)
• Multilingual collection (400+ languages)
• Variable quality of OCR and metadata
• No good data on where chapters begin and end in books
• Relevance challenges
• Index 5 billion pages (lots of short documents) or 17 million books (fewer very
long documents) ?
• Temp solution 2-tiered index
• Index all 17 million books with whole book as Solr document
• Index individual books with page as solr document on-the-fly
• How to combine relevance weights for full-text (long field) with library metadata
(short fields: title, subject, author….)
Large Scale Search Challenges
• Example (simplified)
q= _query_:"{!edismax
qf='ocr^5000
+AnyMetaDataField^2
+TitleField^50
+AuthorField^80
+SubjectField^50
+mm='100%'
tie='0.9' “
Full-text weights vs Title, Subject and Author (Metadata)
Problem: Solr relevance defaults are designed for small
documents. HathiTrust has long documents !
Collection Size Documents Average Doc size
HathiTrust 13 TB 17 million 760 KB
ClueWeb09 (B) 1.2TB 50 million 25 KB
TREC GOV2 0.456 TB 25 million 18 KB
TREC ad hoc 0.002 TB 0.75 million 3 KB
HathiTrust (pages) 13 TB 5,000 million 2KB
• Average HathiTrust document is
760KB containing over 100,000 words.
• Estimated size of 17 million
Document collection is 13 TB.
• Average HathiTrust document is about
30 times larger than the average
document size of 25KB used in Large
Research test collections
• Over 100 times larger than TREC ad hoc 0
100
200
300
400
500
600
700
800
HathiTrust ClueWeb09
(B)
TREC Gov2 NW1000G Spirit
Average Doc Size (KB)
Default settings for ranking algorithms not appropriate for
Hathitrust long documents
BM25 default param K1= 1.2, for small docs:K1 = 1,2, 3
After 50 occurrences no impact of more
K1 adjusted for large docs: K1 = 10,20,30
Up to 500 occurrences affect score
Solr/Lucene 4.0 default (tf*idf) algorithm
ranks short documents too high
• Query = “dog” in full-text
• Average document length for
top 25 hits = 44 pages. (HT
average 300 pages)
• Average total words=8,000 (HT
average=100,000)
• Mostly Children’s books and
poetry.
• Not many words per page due to
illustrations, formatting etc.
• Balanced Interleaving
• Choose A or B first based on coin flip
• A wins toss in this example
• Alternate top results from results lists A and B
• If a result is in both lists , only insert it at highest rank
• Scoring: Count how many clicks are on A results and how
many on B
Compare Lucene 4 default
(tf*idf) with BM25 using
Interleaving
• Why?
• BM25 not tuned properly
• Relative weights of full-text vs metadata fields affected by BM25
• Thanks to Tom Burgman’s Lucene Revolution 2017 presentation (k1 tuning)
• We suspect BM25 is working better for matches in full-text but not giving
enough weight to matches on Author, Title and Subject.
• Next steps
• Analyze click logs
• Try to detect known item (Author or Title) searches and correlate with choice of A
vs B
• Is known item search detection at search time possible?
• Type of intent classification
• Tweak BM25 params and test
Interleaving results: Default tf*idf beats BM25
Bigger problem than tuning BM25 for long documents
Relevant parts of books: TF in Chapters vs Whole Book
Montemuro and Zanette(2009)
• Work on short-term problem of tuning BM25 and weights for full-text vs
metadata fields
• Investigate Solr grouping to index parts of books and rank book by
function of scores for parts.
• Test scalability of Solr grouping
• Investigate Solr ranking functions for grouping
Next steps
Tom Burton-West
tburtonw@umich.edu
www.hathitrust.org/blogs/large-scale-search
Thank You !
1 of 12

More Related Content

Similar to Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West

search.pptsearch.ppt
search.pptPikaj2
3 views48 slides
search enginesearch engine
search engineMusaib Khan
959 views48 slides

Similar to Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West(20)

More from OpenSource Connections(20)

EncoresEncores
Encores
OpenSource Connections2K views
Test driven relevancyTest driven relevancy
Test driven relevancy
OpenSource Connections272 views
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
OpenSource Connections162 views
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
OpenSource Connections655 views
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections1.6K views

Recently uploaded(20)

Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann102 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 views
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 views
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 views
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 views
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 views
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela166 views
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm314 views
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika21 views
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4918 views

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West

  • 2. • HathiTrust is a shared digital repository • 140+ member libraries • Large Scale Search is one of many services built on top of the repository • Currently about 17 million scanned and OCRed books (5 billion pages) • About 1 Petabyte of data • Preservation page images;jpeg 2000, tiff (910TB+) • OCR and Metadata about (16TB) HathiTrust 2
  • 3. • Goal: Design a system for full-text search that will scale to 20 million volumes (at a reasonable cost.) • Multilingual collection (400+ languages) • Variable quality of OCR and metadata • No good data on where chapters begin and end in books • Relevance challenges • Index 5 billion pages (lots of short documents) or 17 million books (fewer very long documents) ? • Temp solution 2-tiered index • Index all 17 million books with whole book as Solr document • Index individual books with page as solr document on-the-fly • How to combine relevance weights for full-text (long field) with library metadata (short fields: title, subject, author….) Large Scale Search Challenges
  • 4. • Example (simplified) q= _query_:"{!edismax qf='ocr^5000 +AnyMetaDataField^2 +TitleField^50 +AuthorField^80 +SubjectField^50 +mm='100%' tie='0.9' “ Full-text weights vs Title, Subject and Author (Metadata)
  • 5. Problem: Solr relevance defaults are designed for small documents. HathiTrust has long documents ! Collection Size Documents Average Doc size HathiTrust 13 TB 17 million 760 KB ClueWeb09 (B) 1.2TB 50 million 25 KB TREC GOV2 0.456 TB 25 million 18 KB TREC ad hoc 0.002 TB 0.75 million 3 KB HathiTrust (pages) 13 TB 5,000 million 2KB • Average HathiTrust document is 760KB containing over 100,000 words. • Estimated size of 17 million Document collection is 13 TB. • Average HathiTrust document is about 30 times larger than the average document size of 25KB used in Large Research test collections • Over 100 times larger than TREC ad hoc 0 100 200 300 400 500 600 700 800 HathiTrust ClueWeb09 (B) TREC Gov2 NW1000G Spirit Average Doc Size (KB)
  • 6. Default settings for ranking algorithms not appropriate for Hathitrust long documents BM25 default param K1= 1.2, for small docs:K1 = 1,2, 3 After 50 occurrences no impact of more K1 adjusted for large docs: K1 = 10,20,30 Up to 500 occurrences affect score
  • 7. Solr/Lucene 4.0 default (tf*idf) algorithm ranks short documents too high • Query = “dog” in full-text • Average document length for top 25 hits = 44 pages. (HT average 300 pages) • Average total words=8,000 (HT average=100,000) • Mostly Children’s books and poetry. • Not many words per page due to illustrations, formatting etc.
  • 8. • Balanced Interleaving • Choose A or B first based on coin flip • A wins toss in this example • Alternate top results from results lists A and B • If a result is in both lists , only insert it at highest rank • Scoring: Count how many clicks are on A results and how many on B Compare Lucene 4 default (tf*idf) with BM25 using Interleaving
  • 9. • Why? • BM25 not tuned properly • Relative weights of full-text vs metadata fields affected by BM25 • Thanks to Tom Burgman’s Lucene Revolution 2017 presentation (k1 tuning) • We suspect BM25 is working better for matches in full-text but not giving enough weight to matches on Author, Title and Subject. • Next steps • Analyze click logs • Try to detect known item (Author or Title) searches and correlate with choice of A vs B • Is known item search detection at search time possible? • Type of intent classification • Tweak BM25 params and test Interleaving results: Default tf*idf beats BM25
  • 10. Bigger problem than tuning BM25 for long documents Relevant parts of books: TF in Chapters vs Whole Book Montemuro and Zanette(2009)
  • 11. • Work on short-term problem of tuning BM25 and weights for full-text vs metadata fields • Investigate Solr grouping to index parts of books and rank book by function of scores for parts. • Test scalability of Solr grouping • Investigate Solr ranking functions for grouping Next steps