Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lucene and Solr Experts Round Table

3,329 views

Published on

Apache Lucene EuroCon brings together the experts driving innovation in Open Source Search. Check out this special one-hour round table presentation on compelling innovations in Lucene/Solr search....

Published in: Technology
  • Links from resources slide updated:
    http://www.lucidimagination.com/devzone/technical-articles/optimizing-findability-lucene-and-solr
    http://www.lucidimagination.com/devzone/technical-articles/debugging-search-application-relevance-issues
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Lucene and Solr Experts Round Table

  1. 1. Apache Lucene Eurocon: Preview www.lucene-eurocon.org Apache Lucene EuroCon 20 May 2010
  2. 2. Overview A link to download these slides will be available after the webcast is complete. An • Introduction on-demand replay will be ready in ~48 hours. • Near Real Time Search: Yonik Seeley • Munching & Crunching: Andrzej Białecki • Solr in the Cloud: Mark Miller • Practical Relevance: Grant Ingersoll • Q&A Apache Lucene EuroCon 20 May 2010 2
  3. 3. Near Real Time Search Yonik Seeley Apache Lucene EuroCon 20 May 2010
  4. 4. Near Real-Time Search Shorter times until updates are searchable/visible Lucene 2.9 first laid the groundwork w/ per-segment searching Per-segment FieldCache entries for sorting and FunctionQueries NRT IndexWriter.getReader() Make new segments available before merging is done in background Doesn’t cause commit/fsync first Solr still needs Per-segment faceting Per-segment caching Per-segment statistics (and anything else that uses FieldCache) Apache Lucene EuroCon 20 May 2010 4
  5. 5. Existing single-values faceting algorithm Documents matching the Lucene FieldCache Entry base query “Juggernaut” (StringIndex) for the “hero” field q=Juggernaut 0 order: for each &facet=true 2 doc, an index into lookup: the lookup &facet.field=hero 7 the lookup array string values 5 (null) accumulator 3 batman 5 flash 0 1 spiderman 1 4 superman 0 increment 5 wolverine 0 2 0 1 2 Apache Lucene EuroCon 20 May 2010 5
  6. 6. Per-segment single-valued faceting algorithm Segment1 Segment2 Segment3 Segment4 FieldCache FieldCache FieldCache FieldCache Entry Entry Entry Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 5 1 0 0 2 0 0 4 7 thread4 1 thread2 thread3 Base 2 DocSet thread1 FieldCache + Priority queue accumulator flash, 5 Batman, 3 merger (Priority queue) Apache Lucene EuroCon 20 May 2010 6
  7. 7. Per-segment faceting Enable with facet.method=fcs Controllable multi-threading facet.field={!threads=4}myfield Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed) Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded Apache Lucene EuroCon 20 May 2010 7
  8. 8. Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Time for request* facet.method=fc facet.method=fcs static index 3 ms 244 ms quickly changing index 1388 ms 267 ms Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B Time for request* facet.method=fc facet.method=fcs static index 26 ms 34 ms quickly changing index 741 ms 94 ms *complete request time, measured externally Apache Lucene EuroCon 20 May 2010 8
  9. 9. 9 Munching & Crunching Lucene index post-processing and applications Andrzej Białecki <andrzej.bialecki@lucidimagination.com> Apache Lucene EuroCon 20 May 2010
  10. 10. Munching & Crunching Agenda Post-processing Splitting, merging, sorting, pruning Tiered search Bitwise search Map-reduce indexing models Apache Lucene EuroCon 20 May 2010 10
  11. 11. Post-processing  Isn't it better to build it right from the start?  Some parameters are difficult to get right...  Minimizing index size while retaining search quality  Correcting impact of unexpected common words  Creating evenly-sized shards  ...perhaps impossible to get at all during indexing  Adding collection-wide factors not computed by Lucene (e.g. avg. length)  Optimizing top-N results for common queries  Fitting too large indexes in RAM Apache Lucene EuroCon 20 May 2010 11
  12. 12. Merging, splitting, sorting, pruning  Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter   Sorting postings by impact and “early termination” search  Index pruning:  What data to remove and how?  Pruning strategies  Challenges Apache Lucene EuroCon 20 May 2010 12
  13. 13. Tiered search  Assuming we CAN prune effectively, while maintaining good search quality... search box RAM 70% pruned SSD 30% pruned ? HDD 0% pruned Apache Lucene EuroCon 20 May 2010 13
  14. 14. Tiered search  Assuming we CAN prune effectively, while maintaining good search quality... search box 1 RAM 70% pruned search box 2 SSD 30% pruned ? search box 3 HDD 0% pruned Apache Lucene EuroCon 20 May 2010 14
  15. 15. Bit-wise search  Given a bit pattern query: 1010 1001 0101 0001  Find best matching bit patterns in documents  Applications:  Fuzzy “fingerprinting”  De-duplication  Plagiarism detection  BitwiseSearcher and Solr BitwiseField design Apache Lucene EuroCon 20 May 2010 15
  16. 16. Massive indexing  Map-reduce indexing models  Google model  Nutch model  Modified Nutch model  Hadoop contrib/indexing model  Tradeoff analysis and recommendations Apache Lucene EuroCon 20 May 2010 16
  17. 17. 1 Solr in the Cloud Mark Miller Apache Lucene EuroCon 20 May 2010 17
  18. 18. Apache Lucene EuroCon 20 May 2010 182
  19. 19. Some of the Complications? Dealing with config files Setting up high availability Status of cluster Reshaping/Rebalancing cluster Apache Lucene EuroCon 20 May 2010 19 19
  20. 20. Improvements: High Level Goals Improve...  Shared/Central Config  High Availability and Fault Tolerance  Cluster Resizing/Rebalancing  Open/Standard ZK schema  Cluster status Apache Lucene EuroCon 20 May 2010 20
  21. 21. Enter Solr Cloud and ZooKeeper ZooKeeper is basically a highly available distributed filesystem Config and cluster state ‘live’ in ZooKeeper Solr is alerted to changes in cluster state by ZK Solr gets a built in load balancing impl that can read cluster state from ZK Clients don’t need to know about shards - or can choose logical shards Apache Lucene EuroCon 20 May 2010 21
  22. 22. What’s Been Done So Far A lot of ‘base’ work - ZooKeeper Mode Shared/Central config Built in search side fault tolerance Very simple cluster status Apache Lucene EuroCon 20 May 2010 22
  23. 23. The Future? Index side fault tolerance Cluster resizing/rebalancing/elasticity More Solr/ZK tools? Lots of other little fun improvements Apache Lucene EuroCon 20 May 2010 23
  24. 24. Practical Relevance Grant Ingersoll Apache Lucene EuroCon 2010 Prague, Czech Republic Apache Lucene EuroCon 20 May 2010 24
  25. 25. Why Tune Relevance?  Better search results = Less time searching, more time acting  Less time searching = Happier, more effective users  Happier, more effective users = $, €, £, Kč (earned/saved)  $, €, £, Kč (earned/saved) = Big fat raise for you! Apache Lucene EuroCon 20 May 2010 25
  26. 26. Testing Relevance  A/B testing  Log Analysis  Empirical  Top 50 queries, plus random sample  Ask  Ratings/Reviews  Focus Groups  Also: Ad Hoc, TREC, etc. Apache Lucene EuroCon 20 May 2010 26
  27. 27. Understand your… Domain Tolerance for Pain Types of documents Managers Languages present Document structures, metadata Business Interests and other features Release cycles Lexical resources: jargon, synonyms, abbreviations... Obsession in finding the Relationships between one true relevance model documents (hint, it doesn’t exist) Users “explain() blindness” Sophistication/Expertise Search and Discovery needs Known Item vs. Keyword Apache Lucene EuroCon 20 May 2010 27
  28. 28. Phrases  Almost always a win to automatically add phrase query variations to all multiword queries  Even better to detect key phrases  In Solr, with the Dismax handler, use the &pf and &ps options to automatically add phrase boosts  Using a large slop factor can simulate an AND query while rewarding close proximity  See also the ComplexPhraseQuery in contrib/queryparser  Consider SpanQuery and derivatives Apache Lucene EuroCon 20 May 2010 28
  29. 29. Resources  ACM SIGIR - http://sigir.org/  http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Debugging-Relevance-Issues-Search  http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Optimizing-Findability-Lucene-and-Solr  Open Relevance Project: http://lucene.apache.org/openrelevance Apache Lucene EuroCon 20 May 2010 29
  30. 30. Q&A SLIDES POSTED AT: BIT.LY/EXPERTS1 Apache Lucene EuroCon 20 May 2010 30
  31. 31. 1 Thank You Apache Lucene EuroCon 20 May 2010 31

×