Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West
2. • HathiTrust is a shared digital repository
• 140+ member libraries
• Large Scale Search is one of many services built on top of the repository
• Currently about 17 million scanned and OCRed books (5 billion pages)
• About 1 Petabyte of data
• Preservation page images;jpeg 2000, tiff (910TB+)
• OCR and Metadata about (16TB)
HathiTrust
2
3. • Goal: Design a system for full-text search that will scale to 20 million
volumes (at a reasonable cost.)
• Multilingual collection (400+ languages)
• Variable quality of OCR and metadata
• No good data on where chapters begin and end in books
• Relevance challenges
• Index 5 billion pages (lots of short documents) or 17 million books (fewer very
long documents) ?
• Temp solution 2-tiered index
• Index all 17 million books with whole book as Solr document
• Index individual books with page as solr document on-the-fly
• How to combine relevance weights for full-text (long field) with library metadata
(short fields: title, subject, author….)
Large Scale Search Challenges
4. • Example (simplified)
q= _query_:"{!edismax
qf='ocr^5000
+AnyMetaDataField^2
+TitleField^50
+AuthorField^80
+SubjectField^50
+mm='100%'
tie='0.9' “
Full-text weights vs Title, Subject and Author (Metadata)
5. Problem: Solr relevance defaults are designed for small
documents. HathiTrust has long documents !
Collection Size Documents Average Doc size
HathiTrust 13 TB 17 million 760 KB
ClueWeb09 (B) 1.2TB 50 million 25 KB
TREC GOV2 0.456 TB 25 million 18 KB
TREC ad hoc 0.002 TB 0.75 million 3 KB
HathiTrust (pages) 13 TB 5,000 million 2KB
• Average HathiTrust document is
760KB containing over 100,000 words.
• Estimated size of 17 million
Document collection is 13 TB.
• Average HathiTrust document is about
30 times larger than the average
document size of 25KB used in Large
Research test collections
• Over 100 times larger than TREC ad hoc 0
100
200
300
400
500
600
700
800
HathiTrust ClueWeb09
(B)
TREC Gov2 NW1000G Spirit
Average Doc Size (KB)
6. Default settings for ranking algorithms not appropriate for
Hathitrust long documents
BM25 default param K1= 1.2, for small docs:K1 = 1,2, 3
After 50 occurrences no impact of more
K1 adjusted for large docs: K1 = 10,20,30
Up to 500 occurrences affect score
7. Solr/Lucene 4.0 default (tf*idf) algorithm
ranks short documents too high
• Query = “dog” in full-text
• Average document length for
top 25 hits = 44 pages. (HT
average 300 pages)
• Average total words=8,000 (HT
average=100,000)
• Mostly Children’s books and
poetry.
• Not many words per page due to
illustrations, formatting etc.
8. • Balanced Interleaving
• Choose A or B first based on coin flip
• A wins toss in this example
• Alternate top results from results lists A and B
• If a result is in both lists , only insert it at highest rank
• Scoring: Count how many clicks are on A results and how
many on B
Compare Lucene 4 default
(tf*idf) with BM25 using
Interleaving
9. • Why?
• BM25 not tuned properly
• Relative weights of full-text vs metadata fields affected by BM25
• Thanks to Tom Burgman’s Lucene Revolution 2017 presentation (k1 tuning)
• We suspect BM25 is working better for matches in full-text but not giving
enough weight to matches on Author, Title and Subject.
• Next steps
• Analyze click logs
• Try to detect known item (Author or Title) searches and correlate with choice of A
vs B
• Is known item search detection at search time possible?
• Type of intent classification
• Tweak BM25 params and test
Interleaving results: Default tf*idf beats BM25
10. Bigger problem than tuning BM25 for long documents
Relevant parts of books: TF in Chapters vs Whole Book
Montemuro and Zanette(2009)
11. • Work on short-term problem of tuning BM25 and weights for full-text vs
metadata fields
• Investigate Solr grouping to index parts of books and rank book by
function of scores for parts.
• Test scalability of Solr grouping
• Investigate Solr ranking functions for grouping
Next steps