Query Latency Optimization with Lucene

Query Latency Optimization
Stefan Pohl
stefan.pohl@here.com

Sr. Research Engineer, Ph.D.

Who Am I
●

Search user, developer, researcher

●

Many years in industry & academia

●

Ph.D. in Information Retrieval

●

Interests: Search, Big Data, Machine Learning

●

Currently working on the Geocoding offer of HERE,
Nokia's Location Platform

●

Spare time: Lucene contributor

7 Nov 2013

Query Latency Optimization with Lucene

2

Agenda
● Motivation
●

Latency Optimization

●

Query Processing / Scoring

●

Recent Developments in Lucene

7 Nov 2013


3

Motivation: Query Latency
● Human Reaction Time: 200 ms *
→ Backend latency: << 200 ms
●

Faster queries means higher manageable load

●

Costs

* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in
Software, Addison-Wesley Professional, 2008.
7 Nov 2013


4

Motivation: Query Latency Distribution

7 Nov 2013


5

Latency Optimization

7 Nov 2013


6

First: Do Your Homework
● Keep enough RAM for OS (disk buffer cache)
● Reduce HDD “pressure” (e.g. throttle indexing)
● SSDs
● Warming
● Ideally: your index fits in memory
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

7 Nov 2013


7

Mining Hypothesis
●

Check if query latencies are reproducible

●

If not, try to find correlations with system events:
–
–
–
–

●

Many new incoming docs to index?
Other daemons spike in disk or CPU activity?
Garbage Collections?
Other sar statistics (e.g. paging)

If yes, profile
–
–

First, your code
Don't instrument Lucene internal low-level classes

7 Nov 2013


8

Hypothesis Testing
●

You really think you understand the problem
and have a potential solution?

●

Try it out (if it's cheap)!

●

Otherwise, think of (cheap) experiments that
–
–

7 Nov 2013

Give confidence
Tell you (and others) what the gains are (ROI)

9

Example: In-memory
●

Buy more memory / bigger machine !?

●

Simulate1
–
–
–

●

1

Consecutively execute the same query multiple times
Much lower memory requirement (i.e. the size of the involved postings)
Repeat for sample of queries of interest

Gives lower bound on query latency

S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European
Conference on Information Retrieval, Toulouse, France, April 2009. Springer.

7 Nov 2013


10

Query Processing

7 Nov 2013


11

Conjunctions (i.e. AND / Occur.MUST)

●

Sort Boolean clauses by increasing DocFreq ft

7 Nov 2013


12


●

Next() on sparsest posting list (“lead”)

7 Nov 2013


13


●

Advance(18) on next sparsest posting list → fail

7 Nov 2013


14


●

Start all over again with “lead”, but advance(22)

7 Nov 2013


15


●

Try to advance(31) on all other posting lists

7 Nov 2013


16


●


7 Nov 2013


17


●


7 Nov 2013


18


●

Match found → R = {31

7 Nov 2013


19


●

Next() on “lead” → R = {31}

7 Nov 2013


20

Disjunctions (i.e. OR / Occur.SHOULD)

7 Nov 2013


21


●

Next() on all clauses

7 Nov 2013


22


●

Track clauses in min-heap → R = {2

7 Nov 2013


23


●

Next() on all previously matched clauses → R = {2,4

7 Nov 2013


24


●

Next() on all previously matched clauses → R = {2,4,5

7 Nov 2013


25


●

Next() → R = {2,4,5,7

7 Nov 2013


26


●

Next() → R = {2,4,5,7,9

7 Nov 2013


27


●

Next() → R = {2,4,5,7,9,11

7 Nov 2013


28


●

Next() → R = {2,4,5,7,9,11,12

7 Nov 2013


29


●

Next() → R = {2,4,5,7,9,11,12,16

7 Nov 2013


30


●

Next() → R = {2,4,5,7,9,11,12,16,18

7 Nov 2013


31


●

Next() → R = {2,4,5,7,9,11,12,16,18,20

7 Nov 2013


32


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22

7 Nov 2013


33


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26

7 Nov 2013


34


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27

7 Nov 2013


35


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29

7 Nov 2013


36


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31

7 Nov 2013


37


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32

7 Nov 2013


38


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37

7 Nov 2013


39


●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}

7 Nov 2013


40

Why Query Processing Can Be Slow?
●

Disjunctive Processing: O(n log |C|)
–
–
–

●

High DF terms (large n)
Many terms (large |C|), e.g. query expansion
No / too little use of advance()

Filter (over-use)

7 Nov 2013


41

Filter
●

Aims:
–
–
–

●

(Pre-)computation of common sub-queries
Cache result
Don't influence scoring

Limitation
–
–

Additional cost for 1st query
Currently, no skip information generated

→ Adding filter as a conjunct to queries can sometimes be faster
e.g. http://java.dzone.com/news/fast-lucene-search-filters
7 Nov 2013


42

Stopword Removal
●

Removal of High-DocFreq terms from
–
–

●

Limitation:
–

●

Index : 10-30% space saving
Query: no very expensive terms

“To be or not to be”

In general, don't do it

7 Nov 2013


43

Minor, But Easy Improvements
●

Reduce information, increase locality:
–

Don't store TF, if it's almost always 1 (and you don't
need positions),
fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);

–

●

Use BlockPostingsFormat (default in Lucene ≥ 4.1)

Tune Space/Time/Quality tradeoffs:
–
–

7 Nov 2013

DirectDocValues
Less complex scoring function

44

Recent Developments
within Lucene
7 Nov 2013


45

MinShouldMatch
●
●

●

(Lucene-4571)

Don't want matches on only one (stop-)word?
Enforce at least mm>1 terms to be present !
Synthetic example query used during dev:
Terms:

ref

restored

struck

wings

dublin

DocFreq:

3.8M

32k

32k

32k

32k

E.g. mm=2:
Conjunctive Processing:
advance()

Disjunctive Processing:
next()

7 Nov 2013


46

MinShouldMatch

7 Nov 2013


(Lucene-4571)

47

MinShouldMatch

7 Nov 2013


(Lucene-4571)

48

MinShouldMatch

(Lucene-4571)

DocFreq:

3.8M

32k

32k

32k

32k

HighDF 1/5:

ref

restored

struck

wings

dublin

HighDF 2/5:

ref

http

struck

wings

dublin

HighDF 3/5:

ref

http

from

wings

dublin

HighDF 4/5:

ref

http

from

name

dublin

HighDF 5/5:

ref

http

from

name

title

DocFreq:

3.8M

3.5M

3.2M

2.8M

2.4M

7 Nov 2013


49

MinShouldMatch – Results

7 Nov 2013


(Lucene-4571)

50

MinShouldMatch – Open Questions
●

●

●

(Lucene-4571)

How bad is it to exclude docs that only match one,
but an important term?
Why is it enough to match any mm terms?
Why not providing a list of stop-words to a
'StopwordExcludingScorer'?
(But be careful: “To Be Or Not To Be”)

7 Nov 2013


51

ReqOptSumScorer
●

Benefit:
–
–

●

Conjunctive processing on required clauses
Calls advance() on optional clauses

How do you determine which clauses are required?
– Lookup term statistics (i.e. DocFreq)
– 2nd lookup unnecessary, if you hand over stats to query

7 Nov 2013


52

CommonTermsQuery (≥ 4.1)
●

Looks up term infos (docfreq, posting list offset)

●

(Lucene-4628)

Categorizes query terms as
–
–

●

Low-freq: At least one low-freq term MUST occur in result doc
High-freq: SHOULD occur in doc → their presence add to score

Executes query, but hands over term statistics
→ no 2nd round of term lookups necessary !

●

Also supports MinShouldMatch

7 Nov 2013


53

Cost-Model (≥ 4.3)
●

What about structured queries? E.g. +(a b) +c

●

(Lucene-4607)

Currently: worst-case estimate of returned #docs (docfreq)
–
–

●

Disjunctions: sumcC(dfc)
Conjunctions: mincC(dfc)

Limitations:
–
–

●

Effort to generate returned docs?
Only one cost (next() vs. advance())

Open Question:
–

Can we do better with more detailed cost models?

7 Nov 2013


54

Maxscore Top-k Scoring Algorithm1

●
●

Experimental prototype code attached to Lucene-4100
Limitation:
–

1

(Lucene-4100)

Requires final run over whole index (i.e. only for static indexes)

H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.

7 Nov 2013


55

Index Sorting (≥ 4.3)
●

Advantages (if appropriate sort order chosen)
–
–

●

(Lucene-4752)

Better compression → more locality → faster processing
Early termination

Use together with EarlyTerminatingSortingCollector
–
–

Can terminate scoring within sorted segments
Fully scores as-yet unsorted segments

→ see 2nd half of Shai & Adrian's talk yesterday for details
7 Nov 2013


56

Parallelization
●

In general, sharding is better:
–
–

●

Shared-nothing
Better use cores for handling load

Multi-threaded query execution:
–

Static indexes:
For slow queries, almost perfect speedups
(if docs are uniformly distributed over shards)

–

Dynamic indexes:
●
Lucene-2840, Lucene-5299

7 Nov 2013


57

Summary
●

Understand your problem

●

Scoring can become an issue with many million docs

●

Many recent efficiency improvements

●

More to come... patches welcome

7 Nov 2013


58

We're Hiring @HERE
Frankfurt, Berlin, Boston, Chicago.

Come work with us.
Get in touch!

7 Nov 2013

developer.here.com/geocoder

59

Thank You!
Contact
Email : stefan.pohl@here.com
Web : http://linkedin.com/in/stefanpohl
Twitter : @pohlstefan

7 Nov 2013

developer.here.com/geocoder

60

Query Latency Optimization with Lucene

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Query Latency Optimization with Lucene

Similar to Query Latency Optimization with Lucene (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Query Latency Optimization with Lucene