Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West


Published on

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - Tom Burton-West

  1. 1. • HathiTrust is a shared digital repository • 140+ member libraries • Large Scale Search is one of many services built on top of the repository • Currently about 17 million scanned and OCRed books (5 billion pages) • About 1 Petabyte of data • Preservation page images;jpeg 2000, tiff (910TB+) • OCR and Metadata about (16TB) HathiTrust 2
  2. 2. • Goal: Design a system for full-text search that will scale to 20 million volumes (at a reasonable cost.) • Multilingual collection (400+ languages) • Variable quality of OCR and metadata • No good data on where chapters begin and end in books • Relevance challenges • Index 5 billion pages (lots of short documents) or 17 million books (fewer very long documents) ? • Temp solution 2-tiered index • Index all 17 million books with whole book as Solr document • Index individual books with page as solr document on-the-fly • How to combine relevance weights for full-text (long field) with library metadata (short fields: title, subject, author….) Large Scale Search Challenges
  3. 3. • Example (simplified) q= _query_:"{!edismax qf='ocr^5000 +AnyMetaDataField^2 +TitleField^50 +AuthorField^80 +SubjectField^50 +mm='100%' tie='0.9' “ Full-text weights vs Title, Subject and Author (Metadata)
  4. 4. Problem: Solr relevance defaults are designed for small documents. HathiTrust has long documents ! Collection Size Documents Average Doc size HathiTrust 13 TB 17 million 760 KB ClueWeb09 (B) 1.2TB 50 million 25 KB TREC GOV2 0.456 TB 25 million 18 KB TREC ad hoc 0.002 TB 0.75 million 3 KB HathiTrust (pages) 13 TB 5,000 million 2KB • Average HathiTrust document is 760KB containing over 100,000 words. • Estimated size of 17 million Document collection is 13 TB. • Average HathiTrust document is about 30 times larger than the average document size of 25KB used in Large Research test collections • Over 100 times larger than TREC ad hoc 0 100 200 300 400 500 600 700 800 HathiTrust ClueWeb09 (B) TREC Gov2 NW1000G Spirit Average Doc Size (KB)
  5. 5. Default settings for ranking algorithms not appropriate for Hathitrust long documents BM25 default param K1= 1.2, for small docs:K1 = 1,2, 3 After 50 occurrences no impact of more K1 adjusted for large docs: K1 = 10,20,30 Up to 500 occurrences affect score
  6. 6. Solr/Lucene 4.0 default (tf*idf) algorithm ranks short documents too high • Query = “dog” in full-text • Average document length for top 25 hits = 44 pages. (HT average 300 pages) • Average total words=8,000 (HT average=100,000) • Mostly Children’s books and poetry. • Not many words per page due to illustrations, formatting etc.
  7. 7. • Balanced Interleaving • Choose A or B first based on coin flip • A wins toss in this example • Alternate top results from results lists A and B • If a result is in both lists , only insert it at highest rank • Scoring: Count how many clicks are on A results and how many on B Compare Lucene 4 default (tf*idf) with BM25 using Interleaving
  8. 8. • Why? • BM25 not tuned properly • Relative weights of full-text vs metadata fields affected by BM25 • Thanks to Tom Burgman’s Lucene Revolution 2017 presentation (k1 tuning) • We suspect BM25 is working better for matches in full-text but not giving enough weight to matches on Author, Title and Subject. • Next steps • Analyze click logs • Try to detect known item (Author or Title) searches and correlate with choice of A vs B • Is known item search detection at search time possible? • Type of intent classification • Tweak BM25 params and test Interleaving results: Default tf*idf beats BM25
  9. 9. Bigger problem than tuning BM25 for long documents Relevant parts of books: TF in Chapters vs Whole Book Montemuro and Zanette(2009)
  10. 10. • Work on short-term problem of tuning BM25 and weights for full-text vs metadata fields • Investigate Solr grouping to index parts of books and rank book by function of scores for parts. • Test scalability of Solr grouping • Investigate Solr ranking functions for grouping Next steps
  11. 11. Tom Burton-West Thank You !