Your SlideShare is downloading. ×
  • Like
  • Save
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010

  • 1,439 views
Published

Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team. …

Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team.

Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them.

Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,439
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Basic Services similar to SOLR Partitions similar to SOLR shards
  • 50% -> 160ms vs 100ms 95% -> 230ms vs 200ms
  • 50% -> 205ms vs 100ms 95% -> 260ms vs 200ms
  • Lexical affinities (LAs) represent the correlation between words co-occurring in a document. LAs are identified by looking at pairs of words found in close proximity to each other. Stemming Process for reducing inflected (or sometimes derived) words to their stem Term Frequency Average Normalization: Where freq(t, d) is the frequency of t in d, and avgFreq(d) is the average of freq(t, d) over all terms t of document d This distinguishes between terms that highly represent a document, to terms that are part of duplicated text. In addition, the logarithmic formula has much smaller variation than Lucene’s original square root based formula, and hence the effect of freq(t, d) on the final score is smaller.
  • Terms with high term frequencies (default min 2) Terms with low document frequencies (default min 5)
  • For example, if you parse a query of "foo-bar", the queryparser will generate a phrase query of "foo bar", because there was no whitespace, but the analyzer generated two tokens foo, bar, each with a position increment. Can you try running some tests (CJK,SmartChinese,whatever), except at query time, and only at query time, add a PositionFilter to the and of your analyzer? ( http://lucene.apache.org/java/2_9_0/api/contrib-analyzers/org/apache/lucene/analysis/position/PositionFilter.html  ? This would cause "normal" queries to be generated for Chinese language, even though it isn’t whitespace separated. Added Position Filter (Query time only) which will tokenize the query properly In addition to using the PositionFilter at query-time, you need to re-enable the coordination factor (as the queryparser disables it in our case)

Transcript

  • 1. Michigan Information Retrieval Enthusiasts Group Meetup Sponsored by Cengage Learning August 19, 2010
  • 2. Meetup Agenda
    • 6:30-7:00pm – Introductions
        • Cengage Learning, Duane May
    • 7:00-8:30pm – Main Presentations:
        • Scaling Performance Tuning With Lucene, John Nader
        • Relevancy Tuning Using TREC Data, Rohit Laungani, Ivan Provalov
    • 8:30-9:00pm – Open Discussion
  • 3. Scaling Lucene John Nader Software Developer
  • 4. Agenda
    • Platform Profile
    • Architecture Overview
    • Adventures in Scalability
    • Partitioning
    • Memory Usage
    • Performance Optimization
    • Wrap-up and Discussion
  • 5. Platform Profile
    • 179 Million Documents (4.5 TB)
      • Books, Periodicals, Multi-Media, Archives
    • 60M Terms across 150 Fields per Partition
    • 6000+ Content Sets - Filters
    • Content Rights Management – More Filters
    • Accurate Hit Counts
    • One Back-End for Multiple Products
    • Ranging from chatty portal products to pure library search
  • 6.  
  • 7.  
  • 8. Architecture
  • 9. Runtime Architecture Partition 1 Basic Services Partition 18 Partition 2 Entity Services Web Products EMC Ice Ice NFS NFS NFS
  • 10. Partition Architecture Search & Retrieve Services Lucene (v3.0.2) Doc Vault JRE 1.6 CPU: AMD 64 Bit 8 Core RAM: 32GB SuSe Linux Ice IIOP (ZeroC)
  • 11. Not Using SOLR….Yet
    • Implementation Pre-dates SOLR
      • Coupling between search and doc retrieval
      • Many features embedded in implementation
      • Currently Looking at how we could migrate to SOLR
  • 12. Adventures in Scaling
  • 13. Impact of Partitioning
    • Over response time much larger than average partition response time.
    • Why?
      • Clue #1: It’s not the time to combine the results…
      • Clue #2: It’s not the network overhead…
      • Clue #3: Statistics catch up with you!
    • Broker must wait for slowest Partition
      • Given a 50% chance a partition responds within 200ms
      • Chance all partitions respond within 200ms is:
      • (0.5) number of partitions
  • 14. Partition vs. Overall
  • 15. Partition vs. Overall
  • 16. Partition Conclusions
    • If you are considering a partition/shard strategy…
      • Plan for the overall response
      • Balance between gain of smaller partitions and the loss of waiting for slowest response
      • Also consider distribution and aggregation overhead.
  • 17. Memory Usage
    • Java Heap Shared Between Document Vault and Search
      • ~3GB for doc vault (entries list, meta-data)
      • ~3GB for content set filters (1.3M per bit set)
      • ~2GB Lucene (Terms, Field Cache)
      • ~1GB Custom Facet Browse Support ( future presentation? )
    • Lucene Memory Mapped Files
      • ~4GB
    • Memory Related Issues:
      • Long running GCs
      • No room for new features (content rights, custom products)
  • 18. Lucene Mapped Files
  • 19. Lucene Mapped Files
    • ~15% in memory
    • May grow with new products/features
    • Still have problems with restarts
      • Lucene 2.4.0 lazy-loads terms
      • Lucene 3.0.2 improved loading of terms
      • Still have issues with first result retrieval
      • Plan: Implement ‘Priming Queries’
  • 20. Java GC Issues
    • Measure First
      • Tools: Eclipse Heap Analysis Tool, jmap, jstat
      • Full scale load tests
      • Monitor production
    • Observations
      • Random Heap Spikes
      • Heap jump by GBytes in a few seconds
      • Continuous GC 5 - 30 minutes!
      • Node becomes unresponsive
  • 21. GC: Cause and Solution
    • Wildcard Query Expansion
      • e.g. KE:*
      • Iterates over all matching terms to construct query
      • Creates lots of objects ( at least in Lucene 2.4 )
      • Lucene documentation warns against using wildcards
    • Mitigation Strategies
      • Block wildcards on < 3 characters for large fields
      • Interrupt ‘runaway’ threads
      • Consider the TimedHitCollector ( would like a TimedWildCardExpander )
    • Result: Significantly Reduced Heap Spikes
  • 22. Additional GC Tuning
    • Java Heap & Memory Mapped Files
      • 20GB Heap + 4GB MMap keeps us out of swap
      • ~10GB hard usage (post-GC)
    • GC Still an issue
      • Full GC once or twice a day
      • 1 - 2 minute pauses
    • Actions:
      • Implemented Concurrent Mark Sweep GC
      • Lower GC threshold to 67%
    • Result:
      • Full GC more frequently, but in much less time (15s)
      • Less than 1s pause during GC
  • 23. Performance Challenges
    • Some legacy code performs at 1M docs, but not 10M
    • Some new features created bottlenecks
    • Content Sets force excessive filtering
    • Hit counts force extra processing
  • 24. Example: Index Browser
    • Requirement: Retrieve a page of terms starting at a given term…
      • Get TermEnum - IndexReader.terms(term)
      • Get next 50 terms – TermEnum.next()
    • … now add limiting by Content Set
      • Iterate over each Term’s Documents
        • indexReader.termDocs(term)
        • termDocs.next()
      • Check each doc until at least one is found in Content Set
      • Repeat with next Term until you find 50 Terms
    • … now add counts by Term
      • Must iterate over all docs in Term
      • Test if in content set and count
  • 25. Index Browser Problems
    • Bigger fields (e.g. Keywords) and Smaller Content Sets
    • Worst Case Scenario
      • Less than 50 docs in Content Set
      • Traverse all docs in all terms
      • 10M docs x 6M terms
    • Actually Slower on Lucene 3.0.2
    • Put the Lucene upgrade in jeopardy
  • 26. Index Browsing Solutions
    • Investigated Root Cause
      • Profiler, Stack Dumps, Lucene Source Code, and the Lucene Forum
    • Found two different issues
      • Lucene 3.0.2 added synchronization
      • Java 1.6.0_12 had JIT issues
    • Results
      • Forum suggested AllTermDocs and seek
      • Upgraded to Java 1.6.0_21
      • Performance improved by 10x
      • Looking at FilterIndexReader, FilterTermDocs
  • 27. Other Enhancements
    • Custom implementation for small filters
    • Improved filter combination/traversal
    • Faceted Search Optimizations
    • Caching term docs for smaller content sets
  • 28. Scaling Lucene: Take-aways
    • Scaling Horizontally is good, but not Linear
    • Consider decoupling Lucene from other services (may scale differently)
    • Legacy algorithms may not scale
    • Beware of Hit Counts
    • Full scale load testing is a must!
  • 29. Discussion
  • 30. Relevancy Tuning Using TREC Data Rohit Laungani – Senior Systems Analyst Ivan Provalov – Information Architect / Developer
  • 31.
    • “ For Google, the quality of search has always been about getting you the exact, most relevant answer you were looking for in the shortest amount of time.. These notions of relevance and speed have been baked into our product development and are always a top priority for us... In fact, at any given time we're conducting between 50 and 200 search experiments , all of which are focused on getting you the exact result you're looking for -- faster ”
    • Jack Menzel, Group Product Manager, Search @ Google
    Background - What are others doing?
  • 32.
    • “ Early this year, we saw a lot of evidence that people are getting much more sophisticated in their searching, asking Google to solve harder problems (for example, by making longer and more complex queries). For this reason, in 2009 alone we have released many improvements: nearly 500 ranking changes”
    • Jack Menzel, Group Product Manager, Search @ Google
    Background - What are others doing?
  • 33. IR Background – Precision/Recall
    • Recall - fraction of relevant documents which has been retrieved.
    • Precision - fraction of retrieved documents which is relevant.
    Collection Answer Set A Relevant Docs in Result Set Ra Relevant Docs R Entire Collection
  • 34. IR Background - Relevance
  • 35. IR Background – Recall-Precision Graph
  • 36. Mean Average Precision (MAP) Average of precision values at the points at which each relevant document is retrieved. R N N N R N N R R R Calculate precision at each point in the ranking where we find a relevant document, then average these values: 1/5 * (1/1 + 2/5 + 3/8 + 4/9 + 5/10) = 0.544 To get a single-number measure across N , say 50, queries, simply average N MAP scores.
  • 37. Background -TREC
    • TREC: Text REtrieval Conference (http://trec.nist.gov/)
    • Annual conference since 1992, co-sponsored by the National Institute of Standards and Technology (NIST) and DARPA.
    • Aim to improve evaluation methods and measures in IR by increasing the research in IR using relatively large test collections on a variety of datasets
    • TREC workshops consists of a set tracks, areas of focus in which particular retrieval tasks are define
    • Track creates necessary infrastructure ( test collections, evaluation methodology etc) to support research on its tasks
    • Participants submit the P/R values for the final document and query corpus and present their results at the conference
  • 38. Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208
  • 39. <DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the &quot;Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities,&quot; Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II &quot;appears to have a higher number of single-vehicle, first event roll-overs,&quot; a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC> Sample TREC Document
  • 40. Project Goals
    • Baseline Current Search Platform’s Quality of Retrieval
    • Create Integration Tests for the Platform and Run These Regularly
    • Evaluate Chinese-based analyzers
  • 41. Methodology
    • TREC Datasets (English and Chinese)
    • IBM Paper 2007 TREC
      • Lexical Affinities
      • Stemming
      • Pivot Length Normalization
      • Sweet Spot Similarity
      • Term Frequency Average Normalization
    • BM25
    • Pseudo Relevance Feedback
  • 42. Pivoted Length Normalization Relevance Document Length P L p
  • 43. Pivoted Length Normalization
    • Boosts shorter documents, “punishes” longer ones
    • U is the number of unique words in the document
    • Pivot is the average of U over all documents
    • Length normalization:
    • 1/((1 - slope) + slope * (U) / (pivot))
  • 44. Example of Pivoted Length Normalization
  • 45. Pseudo Relevance Feedback
    • Retrieve top 5-10 documents for a user query
    • Make the assumption these are relevant documents
    • Retrieve a few terms from these documents and expand the original query:
      • Terms with high term frequencies
      • Terms with low document frequencies
  • 46. Runs Examples
  • 47. Results - English
    • Default Lucene – 0.149 on TREC-3 Collection (comparable with IBM findings)
    • Stemmer – 0.202
    • LA & Stemmer & Phrase – 0.21
    • BM25 – 0.168
    • Sweet Spot Similarity – 0.173
    • Pivoted Length Normalization – 0.184
    • Pivoted Length & Term Frequency Normalization – 0.186
    • Lucene With Porter Stemmer, Pivot Point Document Length Normalization and Pseudo Relevance Feedback (PRF) – 0.30 (100% improvement)
  • 48. Results - Chinese
    • Chinese Paoding Analyzer – 0.444 (x10 improvement over default after applying Position Filter fix as well as the PRF)
    • Rosette Chinese Analyzer – 0.393 (x10 improvement after applying the above techniques)
  • 49. Conclusions
    • Lucene default relevance ranking performs well, but could be tuned further
    • Simple techniques sometimes work best for relevance improvements (stemming, PRF)
    • Open source analyzers perform as well as commercial for Chinese corpus (Paoding)
  • 50. References
    • TREC http:// trec.nist.gov/tracks.html
    • Lucene http:// lucene.apache.org /java/docs/
    • Introduction to IR http://nlp.stanford.edu/IR-book/information-retrieval-book.html
    • IBM TREC 2007 http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
    • Relevance http:// en.wikipedia.org/wiki/Relevance_(information_retrieval )
  • 51. Discussion
  • 52. Collections
    • English
      • TREC 3 Ad Hoc Topics
      • TIPSTER Disk 1 and Disk 2
      • ClueWeb09 “B”
    • Chinese
      • TREC5, TREC6 Topics
      • Peoples Daily newspaper
      • Xinhua newswire