Query Latency Optimization with Lucene
Upcoming SlideShare
Loading in...5
×
 

Query Latency Optimization with Lucene

on

  • 1,192 views

Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business ...

Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business

Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.

Statistics

Views

Total Views
1,192
Views on SlideShare
1,036
Embed Views
156

Actions

Likes
1
Downloads
30
Comments
0

2 Embeds 156

http://www.lucenerevolution.org 153
http://lucenerevolution.org 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Query Latency Optimization with Lucene Query Latency Optimization with Lucene Presentation Transcript

  • Query Latency Optimization Stefan Pohl stefan.pohl@here.com Sr. Research Engineer, Ph.D.
  • Who Am I ● Search user, developer, researcher ● Many years in industry & academia ● Ph.D. in Information Retrieval ● Interests: Search, Big Data, Machine Learning ● Currently working on the Geocoding offer of HERE, Nokia's Location Platform ● Spare time: Lucene contributor 7 Nov 2013 Query Latency Optimization with Lucene 2
  • Agenda ● Motivation ● Latency Optimization ● Query Processing / Scoring ● Recent Developments in Lucene 7 Nov 2013 Query Latency Optimization with Lucene 3
  • Motivation: Query Latency ● Human Reaction Time: 200 ms * → Backend latency: << 200 ms ● Faster queries means higher manageable load ● Costs * Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in Software, Addison-Wesley Professional, 2008. 7 Nov 2013 Query Latency Optimization with Lucene 4
  • Motivation: Query Latency Distribution 7 Nov 2013 Query Latency Optimization with Lucene 5
  • Latency Optimization 7 Nov 2013 Query Latency Optimization with Lucene 6
  • First: Do Your Homework ● Keep enough RAM for OS (disk buffer cache) ● Reduce HDD “pressure” (e.g. throttle indexing) ● SSDs ● Warming ● Ideally: your index fits in memory See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed 7 Nov 2013 Query Latency Optimization with Lucene 7
  • Mining Hypothesis ● Check if query latencies are reproducible ● If not, try to find correlations with system events: – – – – ● Many new incoming docs to index? Other daemons spike in disk or CPU activity? Garbage Collections? Other sar statistics (e.g. paging) If yes, profile – – First, your code Don't instrument Lucene internal low-level classes 7 Nov 2013 Query Latency Optimization with Lucene 8
  • Hypothesis Testing ● You really think you understand the problem and have a potential solution? ● Try it out (if it's cheap)! ● Otherwise, think of (cheap) experiments that – – 7 Nov 2013 Give confidence Tell you (and others) what the gains are (ROI) Query Latency Optimization with Lucene 9
  • Example: In-memory ● Buy more memory / bigger machine !? ● Simulate1 – – – ● 1 Consecutively execute the same query multiple times Much lower memory requirement (i.e. the size of the involved postings) Repeat for sample of queries of interest Gives lower bound on query latency S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer. 7 Nov 2013 Query Latency Optimization with Lucene 10
  • Query Processing 7 Nov 2013 Query Latency Optimization with Lucene 11
  • Conjunctions (i.e. AND / Occur.MUST) ● Sort Boolean clauses by increasing DocFreq ft 7 Nov 2013 Query Latency Optimization with Lucene 12
  • Conjunctions (i.e. AND / Occur.MUST) ● Next() on sparsest posting list (“lead”) 7 Nov 2013 Query Latency Optimization with Lucene 13
  • Conjunctions (i.e. AND / Occur.MUST) ● Advance(18) on next sparsest posting list → fail 7 Nov 2013 Query Latency Optimization with Lucene 14
  • Conjunctions (i.e. AND / Occur.MUST) ● Start all over again with “lead”, but advance(22) 7 Nov 2013 Query Latency Optimization with Lucene 15
  • Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 16
  • Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 17
  • Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 18
  • Conjunctions (i.e. AND / Occur.MUST) ● Match found → R = {31 7 Nov 2013 Query Latency Optimization with Lucene 19
  • Conjunctions (i.e. AND / Occur.MUST) ● Next() on “lead” → R = {31} 7 Nov 2013 Query Latency Optimization with Lucene 20
  • Disjunctions (i.e. OR / Occur.SHOULD) 7 Nov 2013 Query Latency Optimization with Lucene 21
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all clauses 7 Nov 2013 Query Latency Optimization with Lucene 22
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Track clauses in min-heap → R = {2 7 Nov 2013 Query Latency Optimization with Lucene 23
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4 7 Nov 2013 Query Latency Optimization with Lucene 24
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4,5 7 Nov 2013 Query Latency Optimization with Lucene 25
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7 7 Nov 2013 Query Latency Optimization with Lucene 26
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9 7 Nov 2013 Query Latency Optimization with Lucene 27
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11 7 Nov 2013 Query Latency Optimization with Lucene 28
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12 7 Nov 2013 Query Latency Optimization with Lucene 29
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16 7 Nov 2013 Query Latency Optimization with Lucene 30
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18 7 Nov 2013 Query Latency Optimization with Lucene 31
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20 7 Nov 2013 Query Latency Optimization with Lucene 32
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22 7 Nov 2013 Query Latency Optimization with Lucene 33
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26 7 Nov 2013 Query Latency Optimization with Lucene 34
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27 7 Nov 2013 Query Latency Optimization with Lucene 35
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29 7 Nov 2013 Query Latency Optimization with Lucene 36
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31 7 Nov 2013 Query Latency Optimization with Lucene 37
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32 7 Nov 2013 Query Latency Optimization with Lucene 38
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37 7 Nov 2013 Query Latency Optimization with Lucene 39
  • Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37} 7 Nov 2013 Query Latency Optimization with Lucene 40
  • Why Query Processing Can Be Slow? ● Disjunctive Processing: O(n log |C|) – – – ● High DF terms (large n) Many terms (large |C|), e.g. query expansion No / too little use of advance() Filter (over-use) 7 Nov 2013 Query Latency Optimization with Lucene 41
  • Filter ● Aims: – – – ● (Pre-)computation of common sub-queries Cache result Don't influence scoring Limitation – – Additional cost for 1st query Currently, no skip information generated → Adding filter as a conjunct to queries can sometimes be faster e.g. http://java.dzone.com/news/fast-lucene-search-filters 7 Nov 2013 Query Latency Optimization with Lucene 42
  • Stopword Removal ● Removal of High-DocFreq terms from – – ● Limitation: – ● Index : 10-30% space saving Query: no very expensive terms “To be or not to be” In general, don't do it 7 Nov 2013 Query Latency Optimization with Lucene 43
  • Minor, But Easy Improvements ● Reduce information, increase locality: – Don't store TF, if it's almost always 1 (and you don't need positions), fieldType.setIndexOptions(IndexOptions.DOCS_ONLY); – ● Use BlockPostingsFormat (default in Lucene ≥ 4.1) Tune Space/Time/Quality tradeoffs: – – 7 Nov 2013 DirectDocValues Less complex scoring function Query Latency Optimization with Lucene 44
  • Recent Developments within Lucene 7 Nov 2013 Query Latency Optimization with Lucene 45
  • MinShouldMatch ● ● ● (Lucene-4571) Don't want matches on only one (stop-)word? Enforce at least mm>1 terms to be present ! Synthetic example query used during dev: Terms: ref restored struck wings dublin DocFreq: 3.8M 32k 32k 32k 32k E.g. mm=2: Conjunctive Processing: advance() Disjunctive Processing: next() 7 Nov 2013 Query Latency Optimization with Lucene 46
  • MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 47
  • MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 48
  • MinShouldMatch (Lucene-4571) DocFreq: 3.8M 32k 32k 32k 32k HighDF 1/5: ref restored struck wings dublin HighDF 2/5: ref http struck wings dublin HighDF 3/5: ref http from wings dublin HighDF 4/5: ref http from name dublin HighDF 5/5: ref http from name title DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M 7 Nov 2013 Query Latency Optimization with Lucene 49
  • MinShouldMatch – Results 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 50
  • MinShouldMatch – Open Questions ● ● ● (Lucene-4571) How bad is it to exclude docs that only match one, but an important term? Why is it enough to match any mm terms? Why not providing a list of stop-words to a 'StopwordExcludingScorer'? (But be careful: “To Be Or Not To Be”) 7 Nov 2013 Query Latency Optimization with Lucene 51
  • ReqOptSumScorer ● Benefit: – – ● Conjunctive processing on required clauses Calls advance() on optional clauses How do you determine which clauses are required? – Lookup term statistics (i.e. DocFreq) – 2nd lookup unnecessary, if you hand over stats to query 7 Nov 2013 Query Latency Optimization with Lucene 52
  • CommonTermsQuery (≥ 4.1) ● Looks up term infos (docfreq, posting list offset) ● (Lucene-4628) Categorizes query terms as – – ● Low-freq: At least one low-freq term MUST occur in result doc High-freq: SHOULD occur in doc → their presence add to score Executes query, but hands over term statistics → no 2nd round of term lookups necessary ! ● Also supports MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene 53
  • Cost-Model (≥ 4.3) ● What about structured queries? E.g. +(a b) +c ● (Lucene-4607) Currently: worst-case estimate of returned #docs (docfreq) – – ● Disjunctions: sumcC(dfc) Conjunctions: mincC(dfc) Limitations: – – ● Effort to generate returned docs? Only one cost (next() vs. advance()) Open Question: – Can we do better with more detailed cost models? 7 Nov 2013 Query Latency Optimization with Lucene 54
  • Maxscore Top-k Scoring Algorithm1 ● ● Experimental prototype code attached to Lucene-4100 Limitation: – 1 (Lucene-4100) Requires final run over whole index (i.e. only for static indexes) H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995. 7 Nov 2013 Query Latency Optimization with Lucene 55
  • Index Sorting (≥ 4.3) ● Advantages (if appropriate sort order chosen) – – ● (Lucene-4752) Better compression → more locality → faster processing Early termination Use together with EarlyTerminatingSortingCollector – – Can terminate scoring within sorted segments Fully scores as-yet unsorted segments → see 2nd half of Shai & Adrian's talk yesterday for details 7 Nov 2013 Query Latency Optimization with Lucene 56
  • Parallelization ● In general, sharding is better: – – ● Shared-nothing Better use cores for handling load Multi-threaded query execution: – Static indexes: For slow queries, almost perfect speedups (if docs are uniformly distributed over shards) – Dynamic indexes: ● Lucene-2840, Lucene-5299 7 Nov 2013 Query Latency Optimization with Lucene 57
  • Summary ● Understand your problem ● Scoring can become an issue with many million docs ● Many recent efficiency improvements ● More to come... patches welcome 7 Nov 2013 Query Latency Optimization with Lucene 58
  • We're Hiring @HERE Frankfurt, Berlin, Boston, Chicago. Come work with us. Get in touch! 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 59
  • Thank You! Contact Email : stefan.pohl@here.com Web : http://linkedin.com/in/stefanpohl Twitter : @pohlstefan 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 60