SlideShare a Scribd company logo
Query Latency Optimization
Stefan Pohl
stefan.pohl@here.com

Sr. Research Engineer, Ph.D.
Who Am I
●

Search user, developer, researcher

●

Many years in industry & academia

●

Ph.D. in Information Retrieval

●

Interests: Search, Big Data, Machine Learning

●

Currently working on the Geocoding offer of HERE,
Nokia's Location Platform

●

Spare time: Lucene contributor

7 Nov 2013

Query Latency Optimization with Lucene

2
Agenda
● Motivation
●

Latency Optimization

●

Query Processing / Scoring

●

Recent Developments in Lucene

7 Nov 2013

Query Latency Optimization with Lucene

3
Motivation: Query Latency
● Human Reaction Time: 200 ms *
→ Backend latency: << 200 ms
●

Faster queries means higher manageable load

●

Costs

* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in
Software, Addison-Wesley Professional, 2008.
7 Nov 2013

Query Latency Optimization with Lucene

4
Motivation: Query Latency Distribution

7 Nov 2013

Query Latency Optimization with Lucene

5
Latency Optimization

7 Nov 2013

Query Latency Optimization with Lucene

6
First: Do Your Homework
● Keep enough RAM for OS (disk buffer cache)
● Reduce HDD “pressure” (e.g. throttle indexing)
● SSDs
● Warming
● Ideally: your index fits in memory
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

7 Nov 2013

Query Latency Optimization with Lucene

7
Mining Hypothesis
●

Check if query latencies are reproducible

●

If not, try to find correlations with system events:
–
–
–
–

●

Many new incoming docs to index?
Other daemons spike in disk or CPU activity?
Garbage Collections?
Other sar statistics (e.g. paging)

If yes, profile
–
–

First, your code
Don't instrument Lucene internal low-level classes

7 Nov 2013

Query Latency Optimization with Lucene

8
Hypothesis Testing
●

You really think you understand the problem
and have a potential solution?

●

Try it out (if it's cheap)!

●

Otherwise, think of (cheap) experiments that
–
–

7 Nov 2013

Give confidence
Tell you (and others) what the gains are (ROI)
Query Latency Optimization with Lucene

9
Example: In-memory
●

Buy more memory / bigger machine !?

●

Simulate1
–
–
–

●

1

Consecutively execute the same query multiple times
Much lower memory requirement (i.e. the size of the involved postings)
Repeat for sample of queries of interest

Gives lower bound on query latency

S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European
Conference on Information Retrieval, Toulouse, France, April 2009. Springer.

7 Nov 2013

Query Latency Optimization with Lucene

10
Query Processing

7 Nov 2013

Query Latency Optimization with Lucene

11
Conjunctions (i.e. AND / Occur.MUST)

●

Sort Boolean clauses by increasing DocFreq ft

7 Nov 2013

Query Latency Optimization with Lucene

12
Conjunctions (i.e. AND / Occur.MUST)

●

Next() on sparsest posting list (“lead”)

7 Nov 2013

Query Latency Optimization with Lucene

13
Conjunctions (i.e. AND / Occur.MUST)

●

Advance(18) on next sparsest posting list → fail

7 Nov 2013

Query Latency Optimization with Lucene

14
Conjunctions (i.e. AND / Occur.MUST)

●

Start all over again with “lead”, but advance(22)

7 Nov 2013

Query Latency Optimization with Lucene

15
Conjunctions (i.e. AND / Occur.MUST)

●

Try to advance(31) on all other posting lists

7 Nov 2013

Query Latency Optimization with Lucene

16
Conjunctions (i.e. AND / Occur.MUST)

●

Try to advance(31) on all other posting lists

7 Nov 2013

Query Latency Optimization with Lucene

17
Conjunctions (i.e. AND / Occur.MUST)

●

Try to advance(31) on all other posting lists

7 Nov 2013

Query Latency Optimization with Lucene

18
Conjunctions (i.e. AND / Occur.MUST)

●

Match found → R = {31

7 Nov 2013

Query Latency Optimization with Lucene

19
Conjunctions (i.e. AND / Occur.MUST)

●

Next() on “lead” → R = {31}

7 Nov 2013

Query Latency Optimization with Lucene

20
Disjunctions (i.e. OR / Occur.SHOULD)

7 Nov 2013

Query Latency Optimization with Lucene

21
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() on all clauses

7 Nov 2013

Query Latency Optimization with Lucene

22
Disjunctions (i.e. OR / Occur.SHOULD)

●

Track clauses in min-heap → R = {2

7 Nov 2013

Query Latency Optimization with Lucene

23
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() on all previously matched clauses → R = {2,4

7 Nov 2013

Query Latency Optimization with Lucene

24
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() on all previously matched clauses → R = {2,4,5

7 Nov 2013

Query Latency Optimization with Lucene

25
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7

7 Nov 2013

Query Latency Optimization with Lucene

26
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9

7 Nov 2013

Query Latency Optimization with Lucene

27
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11

7 Nov 2013

Query Latency Optimization with Lucene

28
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12

7 Nov 2013

Query Latency Optimization with Lucene

29
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16

7 Nov 2013

Query Latency Optimization with Lucene

30
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18

7 Nov 2013

Query Latency Optimization with Lucene

31
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20

7 Nov 2013

Query Latency Optimization with Lucene

32
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22

7 Nov 2013

Query Latency Optimization with Lucene

33
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26

7 Nov 2013

Query Latency Optimization with Lucene

34
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27

7 Nov 2013

Query Latency Optimization with Lucene

35
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29

7 Nov 2013

Query Latency Optimization with Lucene

36
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31

7 Nov 2013

Query Latency Optimization with Lucene

37
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32

7 Nov 2013

Query Latency Optimization with Lucene

38
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37

7 Nov 2013

Query Latency Optimization with Lucene

39
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}

7 Nov 2013

Query Latency Optimization with Lucene

40
Why Query Processing Can Be Slow?
●

Disjunctive Processing: O(n log |C|)
–
–
–

●

High DF terms (large n)
Many terms (large |C|), e.g. query expansion
No / too little use of advance()

Filter (over-use)

7 Nov 2013

Query Latency Optimization with Lucene

41
Filter
●

Aims:
–
–
–

●

(Pre-)computation of common sub-queries
Cache result
Don't influence scoring

Limitation
–
–

Additional cost for 1st query
Currently, no skip information generated

→ Adding filter as a conjunct to queries can sometimes be faster
e.g. http://java.dzone.com/news/fast-lucene-search-filters
7 Nov 2013

Query Latency Optimization with Lucene

42
Stopword Removal
●

Removal of High-DocFreq terms from
–
–

●

Limitation:
–

●

Index : 10-30% space saving
Query: no very expensive terms

“To be or not to be”

In general, don't do it

7 Nov 2013

Query Latency Optimization with Lucene

43
Minor, But Easy Improvements
●

Reduce information, increase locality:
–

Don't store TF, if it's almost always 1 (and you don't
need positions),
fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);

–

●

Use BlockPostingsFormat (default in Lucene ≥ 4.1)

Tune Space/Time/Quality tradeoffs:
–
–

7 Nov 2013

DirectDocValues
Less complex scoring function
Query Latency Optimization with Lucene

44
Recent Developments
within Lucene
7 Nov 2013

Query Latency Optimization with Lucene

45
MinShouldMatch
●
●

●

(Lucene-4571)

Don't want matches on only one (stop-)word?
Enforce at least mm>1 terms to be present !
Synthetic example query used during dev:
Terms:

ref

restored

struck

wings

dublin

DocFreq:

3.8M

32k

32k

32k

32k

E.g. mm=2:
Conjunctive Processing:
advance()

Disjunctive Processing:
next()

7 Nov 2013

Query Latency Optimization with Lucene

46
MinShouldMatch

7 Nov 2013

Query Latency Optimization with Lucene

(Lucene-4571)

47
MinShouldMatch

7 Nov 2013

Query Latency Optimization with Lucene

(Lucene-4571)

48
MinShouldMatch

(Lucene-4571)

DocFreq:

3.8M

32k

32k

32k

32k

HighDF 1/5:

ref

restored

struck

wings

dublin

HighDF 2/5:

ref

http

struck

wings

dublin

HighDF 3/5:

ref

http

from

wings

dublin

HighDF 4/5:

ref

http

from

name

dublin

HighDF 5/5:

ref

http

from

name

title

DocFreq:

3.8M

3.5M

3.2M

2.8M

2.4M

7 Nov 2013

Query Latency Optimization with Lucene

49
MinShouldMatch – Results

7 Nov 2013

Query Latency Optimization with Lucene

(Lucene-4571)

50
MinShouldMatch – Open Questions
●

●

●

(Lucene-4571)

How bad is it to exclude docs that only match one,
but an important term?
Why is it enough to match any mm terms?
Why not providing a list of stop-words to a
'StopwordExcludingScorer'?
(But be careful: “To Be Or Not To Be”)

7 Nov 2013

Query Latency Optimization with Lucene

51
ReqOptSumScorer
●

Benefit:
–
–

●

Conjunctive processing on required clauses
Calls advance() on optional clauses

How do you determine which clauses are required?
– Lookup term statistics (i.e. DocFreq)
– 2nd lookup unnecessary, if you hand over stats to query

7 Nov 2013

Query Latency Optimization with Lucene

52
CommonTermsQuery (≥ 4.1)
●

Looks up term infos (docfreq, posting list offset)

●

(Lucene-4628)

Categorizes query terms as
–
–

●

Low-freq: At least one low-freq term MUST occur in result doc
High-freq: SHOULD occur in doc → their presence add to score

Executes query, but hands over term statistics
→ no 2nd round of term lookups necessary !

●

Also supports MinShouldMatch

7 Nov 2013

Query Latency Optimization with Lucene

53
Cost-Model (≥ 4.3)
●

What about structured queries? E.g. +(a b) +c

●

(Lucene-4607)

Currently: worst-case estimate of returned #docs (docfreq)
–
–

●

Disjunctions: sumcC(dfc)
Conjunctions: mincC(dfc)

Limitations:
–
–

●

Effort to generate returned docs?
Only one cost (next() vs. advance())

Open Question:
–

Can we do better with more detailed cost models?

7 Nov 2013

Query Latency Optimization with Lucene

54
Maxscore Top-k Scoring Algorithm1

●
●

Experimental prototype code attached to Lucene-4100
Limitation:
–

1

(Lucene-4100)

Requires final run over whole index (i.e. only for static indexes)

H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.

7 Nov 2013

Query Latency Optimization with Lucene

55
Index Sorting (≥ 4.3)
●

Advantages (if appropriate sort order chosen)
–
–

●

(Lucene-4752)

Better compression → more locality → faster processing
Early termination

Use together with EarlyTerminatingSortingCollector
–
–

Can terminate scoring within sorted segments
Fully scores as-yet unsorted segments

→ see 2nd half of Shai & Adrian's talk yesterday for details
7 Nov 2013

Query Latency Optimization with Lucene

56
Parallelization
●

In general, sharding is better:
–
–

●

Shared-nothing
Better use cores for handling load

Multi-threaded query execution:
–

Static indexes:
For slow queries, almost perfect speedups
(if docs are uniformly distributed over shards)

–

Dynamic indexes:
●
Lucene-2840, Lucene-5299

7 Nov 2013

Query Latency Optimization with Lucene

57
Summary
●

Understand your problem

●

Scoring can become an issue with many million docs

●

Many recent efficiency improvements

●

More to come... patches welcome

7 Nov 2013

Query Latency Optimization with Lucene

58
We're Hiring @HERE
Frankfurt, Berlin, Boston, Chicago.

Come work with us.
Get in touch!

7 Nov 2013

developer.here.com/geocoder
Query Latency Optimization with Lucene

59
Thank You!
Contact
Email : stefan.pohl@here.com
Web : http://linkedin.com/in/stefanpohl
Twitter : @pohlstefan

7 Nov 2013

developer.here.com/geocoder
Query Latency Optimization with Lucene

60

More Related Content

What's hot

Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
Steven Wu
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with Prometheus
Maximilian Bode
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
Trisha Gee
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Seunghyun Lee
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
Jean-François Im
 
PostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesPostgreSQL + ZFS best practices
PostgreSQL + ZFS best practices
Sean Chittenden
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
MongoDB
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses ConsistencyEventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
ScyllaDB
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 

What's hot (20)

Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with Prometheus
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
PostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesPostgreSQL + ZFS best practices
PostgreSQL + ZFS best practices
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses ConsistencyEventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 

Similar to Query Latency Optimization with Lucene

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
David Peyruc
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
PAGOdA Presentation
PAGOdA PresentationPAGOdA Presentation
PAGOdA Presentation
DBOnto
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
SigOpt
 
Schedulers
SchedulersSchedulers
Schedulers
Kai Liu
 
Multi-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixMulti-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & Helix
Kishore Gopalakrishna
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
Paolo Missier
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
Evolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStackEvolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStackRyan Richard
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
Tugdual Grall
 
Scale Splunk
Scale SplunkScale Splunk
Scale Splunk
Splunk
 
Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization
leenachandra
 
Unit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimizationUnit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimization
leenachandra
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
Netcetera
 
What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3
Pavan Deolasee
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
Greg Hanchin
 

Similar to Query Latency Optimization with Lucene (20)

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
PAGOdA Presentation
PAGOdA PresentationPAGOdA Presentation
PAGOdA Presentation
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Postgres
PostgresPostgres
Postgres
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Schedulers
SchedulersSchedulers
Schedulers
 
Multi-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixMulti-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & Helix
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Evolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStackEvolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStack
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Scale Splunk
Scale SplunkScale Splunk
Scale Splunk
 
Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization
 
Unit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimizationUnit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimization
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
 
What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3
 
isd312-09-summarization
isd312-09-summarizationisd312-09-summarization
isd312-09-summarization
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
lucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
lucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
lucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Query Latency Optimization with Lucene

  • 1. Query Latency Optimization Stefan Pohl stefan.pohl@here.com Sr. Research Engineer, Ph.D.
  • 2. Who Am I ● Search user, developer, researcher ● Many years in industry & academia ● Ph.D. in Information Retrieval ● Interests: Search, Big Data, Machine Learning ● Currently working on the Geocoding offer of HERE, Nokia's Location Platform ● Spare time: Lucene contributor 7 Nov 2013 Query Latency Optimization with Lucene 2
  • 3. Agenda ● Motivation ● Latency Optimization ● Query Processing / Scoring ● Recent Developments in Lucene 7 Nov 2013 Query Latency Optimization with Lucene 3
  • 4. Motivation: Query Latency ● Human Reaction Time: 200 ms * → Backend latency: << 200 ms ● Faster queries means higher manageable load ● Costs * Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in Software, Addison-Wesley Professional, 2008. 7 Nov 2013 Query Latency Optimization with Lucene 4
  • 5. Motivation: Query Latency Distribution 7 Nov 2013 Query Latency Optimization with Lucene 5
  • 6. Latency Optimization 7 Nov 2013 Query Latency Optimization with Lucene 6
  • 7. First: Do Your Homework ● Keep enough RAM for OS (disk buffer cache) ● Reduce HDD “pressure” (e.g. throttle indexing) ● SSDs ● Warming ● Ideally: your index fits in memory See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed 7 Nov 2013 Query Latency Optimization with Lucene 7
  • 8. Mining Hypothesis ● Check if query latencies are reproducible ● If not, try to find correlations with system events: – – – – ● Many new incoming docs to index? Other daemons spike in disk or CPU activity? Garbage Collections? Other sar statistics (e.g. paging) If yes, profile – – First, your code Don't instrument Lucene internal low-level classes 7 Nov 2013 Query Latency Optimization with Lucene 8
  • 9. Hypothesis Testing ● You really think you understand the problem and have a potential solution? ● Try it out (if it's cheap)! ● Otherwise, think of (cheap) experiments that – – 7 Nov 2013 Give confidence Tell you (and others) what the gains are (ROI) Query Latency Optimization with Lucene 9
  • 10. Example: In-memory ● Buy more memory / bigger machine !? ● Simulate1 – – – ● 1 Consecutively execute the same query multiple times Much lower memory requirement (i.e. the size of the involved postings) Repeat for sample of queries of interest Gives lower bound on query latency S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer. 7 Nov 2013 Query Latency Optimization with Lucene 10
  • 11. Query Processing 7 Nov 2013 Query Latency Optimization with Lucene 11
  • 12. Conjunctions (i.e. AND / Occur.MUST) ● Sort Boolean clauses by increasing DocFreq ft 7 Nov 2013 Query Latency Optimization with Lucene 12
  • 13. Conjunctions (i.e. AND / Occur.MUST) ● Next() on sparsest posting list (“lead”) 7 Nov 2013 Query Latency Optimization with Lucene 13
  • 14. Conjunctions (i.e. AND / Occur.MUST) ● Advance(18) on next sparsest posting list → fail 7 Nov 2013 Query Latency Optimization with Lucene 14
  • 15. Conjunctions (i.e. AND / Occur.MUST) ● Start all over again with “lead”, but advance(22) 7 Nov 2013 Query Latency Optimization with Lucene 15
  • 16. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 16
  • 17. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 17
  • 18. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 18
  • 19. Conjunctions (i.e. AND / Occur.MUST) ● Match found → R = {31 7 Nov 2013 Query Latency Optimization with Lucene 19
  • 20. Conjunctions (i.e. AND / Occur.MUST) ● Next() on “lead” → R = {31} 7 Nov 2013 Query Latency Optimization with Lucene 20
  • 21. Disjunctions (i.e. OR / Occur.SHOULD) 7 Nov 2013 Query Latency Optimization with Lucene 21
  • 22. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all clauses 7 Nov 2013 Query Latency Optimization with Lucene 22
  • 23. Disjunctions (i.e. OR / Occur.SHOULD) ● Track clauses in min-heap → R = {2 7 Nov 2013 Query Latency Optimization with Lucene 23
  • 24. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4 7 Nov 2013 Query Latency Optimization with Lucene 24
  • 25. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4,5 7 Nov 2013 Query Latency Optimization with Lucene 25
  • 26. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7 7 Nov 2013 Query Latency Optimization with Lucene 26
  • 27. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9 7 Nov 2013 Query Latency Optimization with Lucene 27
  • 28. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11 7 Nov 2013 Query Latency Optimization with Lucene 28
  • 29. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12 7 Nov 2013 Query Latency Optimization with Lucene 29
  • 30. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16 7 Nov 2013 Query Latency Optimization with Lucene 30
  • 31. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18 7 Nov 2013 Query Latency Optimization with Lucene 31
  • 32. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20 7 Nov 2013 Query Latency Optimization with Lucene 32
  • 33. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22 7 Nov 2013 Query Latency Optimization with Lucene 33
  • 34. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26 7 Nov 2013 Query Latency Optimization with Lucene 34
  • 35. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27 7 Nov 2013 Query Latency Optimization with Lucene 35
  • 36. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29 7 Nov 2013 Query Latency Optimization with Lucene 36
  • 37. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31 7 Nov 2013 Query Latency Optimization with Lucene 37
  • 38. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32 7 Nov 2013 Query Latency Optimization with Lucene 38
  • 39. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37 7 Nov 2013 Query Latency Optimization with Lucene 39
  • 40. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37} 7 Nov 2013 Query Latency Optimization with Lucene 40
  • 41. Why Query Processing Can Be Slow? ● Disjunctive Processing: O(n log |C|) – – – ● High DF terms (large n) Many terms (large |C|), e.g. query expansion No / too little use of advance() Filter (over-use) 7 Nov 2013 Query Latency Optimization with Lucene 41
  • 42. Filter ● Aims: – – – ● (Pre-)computation of common sub-queries Cache result Don't influence scoring Limitation – – Additional cost for 1st query Currently, no skip information generated → Adding filter as a conjunct to queries can sometimes be faster e.g. http://java.dzone.com/news/fast-lucene-search-filters 7 Nov 2013 Query Latency Optimization with Lucene 42
  • 43. Stopword Removal ● Removal of High-DocFreq terms from – – ● Limitation: – ● Index : 10-30% space saving Query: no very expensive terms “To be or not to be” In general, don't do it 7 Nov 2013 Query Latency Optimization with Lucene 43
  • 44. Minor, But Easy Improvements ● Reduce information, increase locality: – Don't store TF, if it's almost always 1 (and you don't need positions), fieldType.setIndexOptions(IndexOptions.DOCS_ONLY); – ● Use BlockPostingsFormat (default in Lucene ≥ 4.1) Tune Space/Time/Quality tradeoffs: – – 7 Nov 2013 DirectDocValues Less complex scoring function Query Latency Optimization with Lucene 44
  • 45. Recent Developments within Lucene 7 Nov 2013 Query Latency Optimization with Lucene 45
  • 46. MinShouldMatch ● ● ● (Lucene-4571) Don't want matches on only one (stop-)word? Enforce at least mm>1 terms to be present ! Synthetic example query used during dev: Terms: ref restored struck wings dublin DocFreq: 3.8M 32k 32k 32k 32k E.g. mm=2: Conjunctive Processing: advance() Disjunctive Processing: next() 7 Nov 2013 Query Latency Optimization with Lucene 46
  • 47. MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 47
  • 48. MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 48
  • 49. MinShouldMatch (Lucene-4571) DocFreq: 3.8M 32k 32k 32k 32k HighDF 1/5: ref restored struck wings dublin HighDF 2/5: ref http struck wings dublin HighDF 3/5: ref http from wings dublin HighDF 4/5: ref http from name dublin HighDF 5/5: ref http from name title DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M 7 Nov 2013 Query Latency Optimization with Lucene 49
  • 50. MinShouldMatch – Results 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 50
  • 51. MinShouldMatch – Open Questions ● ● ● (Lucene-4571) How bad is it to exclude docs that only match one, but an important term? Why is it enough to match any mm terms? Why not providing a list of stop-words to a 'StopwordExcludingScorer'? (But be careful: “To Be Or Not To Be”) 7 Nov 2013 Query Latency Optimization with Lucene 51
  • 52. ReqOptSumScorer ● Benefit: – – ● Conjunctive processing on required clauses Calls advance() on optional clauses How do you determine which clauses are required? – Lookup term statistics (i.e. DocFreq) – 2nd lookup unnecessary, if you hand over stats to query 7 Nov 2013 Query Latency Optimization with Lucene 52
  • 53. CommonTermsQuery (≥ 4.1) ● Looks up term infos (docfreq, posting list offset) ● (Lucene-4628) Categorizes query terms as – – ● Low-freq: At least one low-freq term MUST occur in result doc High-freq: SHOULD occur in doc → their presence add to score Executes query, but hands over term statistics → no 2nd round of term lookups necessary ! ● Also supports MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene 53
  • 54. Cost-Model (≥ 4.3) ● What about structured queries? E.g. +(a b) +c ● (Lucene-4607) Currently: worst-case estimate of returned #docs (docfreq) – – ● Disjunctions: sumcC(dfc) Conjunctions: mincC(dfc) Limitations: – – ● Effort to generate returned docs? Only one cost (next() vs. advance()) Open Question: – Can we do better with more detailed cost models? 7 Nov 2013 Query Latency Optimization with Lucene 54
  • 55. Maxscore Top-k Scoring Algorithm1 ● ● Experimental prototype code attached to Lucene-4100 Limitation: – 1 (Lucene-4100) Requires final run over whole index (i.e. only for static indexes) H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995. 7 Nov 2013 Query Latency Optimization with Lucene 55
  • 56. Index Sorting (≥ 4.3) ● Advantages (if appropriate sort order chosen) – – ● (Lucene-4752) Better compression → more locality → faster processing Early termination Use together with EarlyTerminatingSortingCollector – – Can terminate scoring within sorted segments Fully scores as-yet unsorted segments → see 2nd half of Shai & Adrian's talk yesterday for details 7 Nov 2013 Query Latency Optimization with Lucene 56
  • 57. Parallelization ● In general, sharding is better: – – ● Shared-nothing Better use cores for handling load Multi-threaded query execution: – Static indexes: For slow queries, almost perfect speedups (if docs are uniformly distributed over shards) – Dynamic indexes: ● Lucene-2840, Lucene-5299 7 Nov 2013 Query Latency Optimization with Lucene 57
  • 58. Summary ● Understand your problem ● Scoring can become an issue with many million docs ● Many recent efficiency improvements ● More to come... patches welcome 7 Nov 2013 Query Latency Optimization with Lucene 58
  • 59. We're Hiring @HERE Frankfurt, Berlin, Boston, Chicago. Come work with us. Get in touch! 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 59
  • 60. Thank You! Contact Email : stefan.pohl@here.com Web : http://linkedin.com/in/stefanpohl Twitter : @pohlstefan 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 60