SlideShare a Scribd company logo
OCTOBER 13-16, 2015 • AUSTIN, TX
Faceting optimizations for Solr
Toke Eskildsen
Search Engineer / Solr Hacker
State and University Library, Denmark
@TokeEskildsen / te@statsbiblioteket.dk
3
3/55
Overview

Web scale at the State and University Library,
Denmark

Field faceting 101

Optimizations
− Reuse
− Tracking
− Caching
− Alternative counters
4/55
Web scale for a small web

Denmark
− Consolidation circa 10th
century
− 5.6 million people

Danish Net Archive (http://netarkivet.dk)
− Constitution 2005
− 20 billion items / 590TB+ raw data
5/55
Indexing 20 billion web items / 590TB into Solr

Solr index size is 1/9th of real data = 70TB

Each shard holds 200M documents / 900GB
− Shards build chronologically by dedicated machine
− Projected 80 shards
− Current build time per shard: 4 days
− Total build time is 20 CPU-core years
− So far only 7.4 billion documents / 27TB in index
6/55
Searching a 7.4 billion documents / 27TB Solr index

SolrCloud with 2 machines, each having
− 16 HT-cores, 256GB RAM, 25 * 930GB SSD
− 25 shards @ 900GB
− 1 Solr/shard/SSD, Xmx=8g, Solr 4.10
− Disk cache 100GB or < 1% of index size
7/55
8/55
String faceting 101 (single shard)
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
for entry: priorityQueue
result.add(resolveTerm(ordinal), count)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3
9/55
Test setup 1 (easy start)

Solr setup
− 16 HT-cores, 256GB RAM, SSD
− Single shard 250M documents / 900GB

URL field
− Single String value
− 200M unique terms

3 concurrent “users”

Random search terms
10/55
Vanilla Solr, single shard, 250M documents, 200M values, 3 users
11/55
Allocating and dereferencing 800MB arrays
12/55
Reuse the counter
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
<counter no more referenced and will be garbage collected at some point>
13/55
Reuse the counter
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
Note: The JSON Facet API in Solr 5 already supports reuse of counters
14/55
Using and clearing 800MB arrays
15/55
Reusing counters vs. not doing so
16/55
Reusing counters, now with readable visualization
17/55
Reusing counters, now with readable visualization
Why does it always take more than 500ms?
18/55
Iteration is not free
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
200M unique terms = 800MB
19/55
ord counter
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
tracker
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Tracking updated counters
20/55
ord counter
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
Tracking updated counters
21/55
ord counter
0 0
1 1
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
Tracking updated counters
22/55
ord counter
0 0
1 3
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
Tracking updated counters
23/55
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
counter[8]++
counter[8]++
counter[4]++
counter[8]++
counter[5]++
counter[1]++
counter[1]++
…
counter[1]++
Tracking updated counters
24/55
Tracking updated counters
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
if tracked < maxTracked
for i = 0 ; i < tracked ; i++
priorityQueue.add(tracker[i], counter[tracker[i]])
else
for ordinal = 0 ; ordinal < counter.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
25/55
Tracking updated counters
26/55
Distributed faceting
Phase 1) All shards performs faceting.
The Merger calculates the top-X terms.
Phase 2) The term counts are requested from the shards
that did not return them in phase 1.
The Merger calculates the final counts for the top-X terms.
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
27/55
Test setup 2 (more shards, smaller field)

Solr setup
− 16 HT-cores, 256GB RAM, SSD
− 9 shards @ 250M documents / 900GB

domain field
− Single String value
− 1.1M unique terms per shard

1 concurrent “user”

Random search terms
28/55
Pit of Pain™ (or maybe “Horrible Hill”?)
29/55
Fine counting can be slow
Phase 1: Standard faceting
Phase 2:
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
30/55
Alternative fine counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter.increment(ordinal)
for term: fineCountRequest.getTerms()
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1, which yields
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
31/55
Using cached counters from phase 1 in phase 2
counter = pool.getCounter(key)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))
pool.release(counter)
32/55
Pit of Pain™ practically eliminated
33/55
Pit of Pain™ practically eliminated
Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com
34/55
Test setup 3 (more shards, more fields)

Solr setup
− 16 HT-cores, 256GB RAM, SSD
− 23 shards @ 250M documents / 900GB

Faceting on 6 fields
− url: ~200M unique terms / shard
− domain & host: ~1M unique terms each / shard
− type, suffix, year: < 1000 unique terms / shard
35/55
1 machine, 7 billion documents / 23TB total index, 6 facet fields
36/55
High-cardinality can mean different things
Single shard / 250,000,000 docs / 900GB
Field References Max docs/term Unique terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
2440 MB / counter
37/55
Remember: 1 machine = 25 shards
25 shards / 7 billion / 23TB
Field References Max docs/term Unique terms
domain 7,000,000,000 3,000,000 ~25,000,000
url 7,000,000,000 56,000 ~5,000,000,000
links 125,000,000,000 5,000,000 ~15,000,000,000
60 GB / facet call
38/55
Different distributions
domain 1.1M url 200M links 600M
High max
Low max
Very long tail
Short tail
39/55
Theoretical lower limit per counter: log2(max_count)
max=1
max=7
max=2047
max=3
max=63
40/55
int vs. PackedInts
domain: 4 MB
url: 780 MB
links: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
41/55
n-plane-z counters
Platonic ideal Harsh reality
Plane d
Plane c
Plane b
Plane a
42/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
43/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
44/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
45/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
46/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
L: 4 ≣ 000111
L: 5 ≣ 001001
L: 6 ≣ 001011
L: 7 ≣ 001101
...
L: 12 ≣ 010111
47/55
Comparison of counter structures
domain: 4 MB
url: 780 MB
links: 2350 MB
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
domain: 1 MB (30%)
url: 66 MB ( 8%)
links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z
48/55
Speed comparison
49/55
I could go on about

Threaded counting

Heuristic faceting

Fine count skipping

Counter capping

Monotonically increasing tracker for n-plane-z

Regexp filtering
50/55
What about huge result sets?

Rare for explorative term-based searches

Common for batch extractions

Threading works poorly as #shards > #CPUs

But how bad is it really?
51/55
Really bad! 8 minutes
52/55
Heuristic faceting

Use sampling to guess top-X terms
− Re-use the existing tracked counters
− 1:1000 sampling seems usable for the field links,
which has 5 billion references per shard

Fine-count the guessed terms
53/55
Over provisioning helps validity
54/55
10 seconds < 8 minutes
55/55
Never enough time, but talk to me about

Threaded counting

Monotonically increasing tracker for n-plane-z

Regexp filtering

Fine count skipping

Counter capping
56/55
Extra info
The techniques presented can be tested with sparse faceting, available as a plug-in replacement
WAR for Solr 4.10 at https://tokee.github.io/lucene-solr/. A version for Solr 5 will eventually be
implemented, but the timeframe is unknown.
No current plans for incorporating the full feature set in the official Solr distribution exists.
Suggested approach for incorporation is to split it into multiple independent or semi-
independent features, starting with those applicable to most people, such as the distributes
faceting fine count optimization.
In-depth descriptions and performance tests of the different features can be found at
https://sbdevel.wordpress.com.
57/55
18M documents / 50GB, facet on 5 fields (2*10M values, 3*smaller)
58/55
6 billion docs / 20TB, 25 shards, single machine
facet on 6 fields (1*4000M, 2*20M, 3*smaller)
59/55
7 billion docs / 23TB, 25 shards, single machine
facet on 5 fields (2*20M, 3*smaller)

More Related Content

What's hot

05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected
Andres Mendez-Vazquez
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
John Conley
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Databricks
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Altinity Ltd
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
Samir Bessalah
 
Как приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиКак приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практики
SQALab
 
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
andyseaborne
 
SSN-TC workshop talk at ISWC 2015 on Emrooz
SSN-TC workshop talk at ISWC 2015 on EmroozSSN-TC workshop talk at ISWC 2015 on Emrooz
SSN-TC workshop talk at ISWC 2015 on Emrooz
Markus Stocker
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
DataWorks Summit/Hadoop Summit
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
Julien Le Dem
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Databricks
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Databricks
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 

What's hot (20)

05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Как приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиКак приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практики
 
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore
 
SSN-TC workshop talk at ISWC 2015 on Emrooz
SSN-TC workshop talk at ISWC 2015 on EmroozSSN-TC workshop talk at ISWC 2015 on Emrooz
SSN-TC workshop talk at ISWC 2015 on Emrooz
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 

Viewers also liked

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Lucidworks
 
Top apache solr features
Top apache solr featuresTop apache solr features
Top apache solr featuresIntellipaat
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
David Smiley
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lucidworks
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
thelabdude
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Lucidworks
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
Cominvent AS
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Svetlin Nakov
 
Using Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategyUsing Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategy
Craig Martin
 

Viewers also liked (11)

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
 
Top apache solr features
Top apache solr featuresTop apache solr features
Top apache solr features
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
 
Using Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategyUsing Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategy
 

Similar to Faceting optimizations for Solr

Adventures in RDS Load Testing
Adventures in RDS Load TestingAdventures in RDS Load Testing
Adventures in RDS Load Testing
Mike Harnish
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
Redis Labs
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
QAware GmbH
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
pasalapudi
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
EDB
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Oracle Result Cache deep dive
Oracle Result Cache deep diveOracle Result Cache deep dive
Oracle Result Cache deep dive
Alexander Tokarev
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Timescale
 
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextTuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Lucidworks
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
ZhangZhengming
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
Jayant Shekhar
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Seunghyun Lee
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Oracle result cache highload 2017
Oracle result cache highload 2017Oracle result cache highload 2017
Oracle result cache highload 2017
Alexander Tokarev
 
Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015
Phillip Trelford
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
Ontico
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 

Similar to Faceting optimizations for Solr (20)

Adventures in RDS Load Testing
Adventures in RDS Load TestingAdventures in RDS Load Testing
Adventures in RDS Load Testing
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Oracle Result Cache deep dive
Oracle Result Cache deep diveOracle Result Cache deep dive
Oracle Result Cache deep dive
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextTuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Oracle result cache highload 2017
Oracle result cache highload 2017Oracle result cache highload 2017
Oracle result cache highload 2017
 
Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015Beyond lists - Copenhagen 2015
Beyond lists - Copenhagen 2015
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 

Recently uploaded

GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 

Recently uploaded (20)

GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 

Faceting optimizations for Solr

  • 1. OCTOBER 13-16, 2015 • AUSTIN, TX
  • 2. Faceting optimizations for Solr Toke Eskildsen Search Engineer / Solr Hacker State and University Library, Denmark @TokeEskildsen / te@statsbiblioteket.dk
  • 3. 3 3/55 Overview  Web scale at the State and University Library, Denmark  Field faceting 101  Optimizations − Reuse − Tracking − Caching − Alternative counters
  • 4. 4/55 Web scale for a small web  Denmark − Consolidation circa 10th century − 5.6 million people  Danish Net Archive (http://netarkivet.dk) − Constitution 2005 − 20 billion items / 590TB+ raw data
  • 5. 5/55 Indexing 20 billion web items / 590TB into Solr  Solr index size is 1/9th of real data = 70TB  Each shard holds 200M documents / 900GB − Shards build chronologically by dedicated machine − Projected 80 shards − Current build time per shard: 4 days − Total build time is 20 CPU-core years − So far only 7.4 billion documents / 27TB in index
  • 6. 6/55 Searching a 7.4 billion documents / 27TB Solr index  SolrCloud with 2 machines, each having − 16 HT-cores, 256GB RAM, 25 * 930GB SSD − 25 shards @ 900GB − 1 Solr/shard/SSD, Xmx=8g, Solr 4.10 − Disk cache 100GB or < 1% of index size
  • 8. 8/55 String faceting 101 (single shard) counter = new int[ordinals] for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) for entry: priorityQueue result.add(resolveTerm(ordinal), count) ord term counter 0 A 0 1 B 3 2 C 0 3 D 1006 4 E 1 5 F 1 6 G 0 7 H 0 8 I 3
  • 9. 9/55 Test setup 1 (easy start)  Solr setup − 16 HT-cores, 256GB RAM, SSD − Single shard 250M documents / 900GB  URL field − Single String value − 200M unique terms  3 concurrent “users”  Random search terms
  • 10. 10/55 Vanilla Solr, single shard, 250M documents, 200M values, 3 users
  • 12. 12/55 Reuse the counter counter = new int[ordinals] for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) <counter no more referenced and will be garbage collected at some point>
  • 13. 13/55 Reuse the counter counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) pool.release(counter) Note: The JSON Facet API in Solr 5 already supports reuse of counters
  • 14. 14/55 Using and clearing 800MB arrays
  • 16. 16/55 Reusing counters, now with readable visualization
  • 17. 17/55 Reusing counters, now with readable visualization Why does it always take more than 500ms?
  • 18. 18/55 Iteration is not free counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) pool.release(counter) 200M unique terms = 800MB
  • 19. 19/55 ord counter 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 tracker N/A N/A N/A N/A N/A N/A N/A N/A N/A Tracking updated counters
  • 20. 20/55 ord counter 0 0 1 0 2 0 3 1 4 0 5 0 6 0 7 0 8 0 tracker 3 N/A N/A N/A N/A N/A N/A N/A N/A counter[3]++ Tracking updated counters
  • 21. 21/55 ord counter 0 0 1 1 2 0 3 1 4 0 5 0 6 0 7 0 8 0 tracker 3 1 N/A N/A N/A N/A N/A N/A N/A counter[3]++ counter[1]++ Tracking updated counters
  • 22. 22/55 ord counter 0 0 1 3 2 0 3 1 4 0 5 0 6 0 7 0 8 0 tracker 3 1 N/A N/A N/A N/A N/A N/A N/A counter[3]++ counter[1]++ counter[1]++ counter[1]++ Tracking updated counters
  • 23. 23/55 ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3 tracker 3 1 8 4 5 N/A N/A N/A N/A counter[3]++ counter[1]++ counter[1]++ counter[1]++ counter[8]++ counter[8]++ counter[4]++ counter[8]++ counter[5]++ counter[1]++ counter[1]++ … counter[1]++ Tracking updated counters
  • 24. 24/55 Tracking updated counters counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) if counter[ordinal]++ == 0 && tracked < maxTracked tracker[tracked++] = ordinal if tracked < maxTracked for i = 0 ; i < tracked ; i++ priorityQueue.add(tracker[i], counter[tracker[i]]) else for ordinal = 0 ; ordinal < counter.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3 tracker 3 1 8 4 5 N/A N/A N/A N/A
  • 26. 26/55 Distributed faceting Phase 1) All shards performs faceting. The Merger calculates the top-X terms. Phase 2) The term counts are requested from the shards that did not return them in phase 1. The Merger calculates the final counts for the top-X terms. for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))
  • 27. 27/55 Test setup 2 (more shards, smaller field)  Solr setup − 16 HT-cores, 256GB RAM, SSD − 9 shards @ 250M documents / 900GB  domain field − Single String value − 1.1M unique terms per shard  1 concurrent “user”  Random search terms
  • 28. 28/55 Pit of Pain™ (or maybe “Horrible Hill”?)
  • 29. 29/55 Fine counting can be slow Phase 1: Standard faceting Phase 2: for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))
  • 30. 30/55 Alternative fine counting counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter.increment(ordinal) for term: fineCountRequest.getTerms() result.add(term, counter.get(getOrdinal(term))) }Same as phase 1, which yields ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3
  • 31. 31/55 Using cached counters from phase 1 in phase 2 counter = pool.getCounter(key) for term: query.getTerms() result.add(term, counter.get(getOrdinal(term))) pool.release(counter)
  • 32. 32/55 Pit of Pain™ practically eliminated
  • 33. 33/55 Pit of Pain™ practically eliminated Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com
  • 34. 34/55 Test setup 3 (more shards, more fields)  Solr setup − 16 HT-cores, 256GB RAM, SSD − 23 shards @ 250M documents / 900GB  Faceting on 6 fields − url: ~200M unique terms / shard − domain & host: ~1M unique terms each / shard − type, suffix, year: < 1000 unique terms / shard
  • 35. 35/55 1 machine, 7 billion documents / 23TB total index, 6 facet fields
  • 36. 36/55 High-cardinality can mean different things Single shard / 250,000,000 docs / 900GB Field References Max docs/term Unique terms domain 250,000,000 3,000,000 1,100,000 url 250,000,000 56,000 200,000,000 links 5,800,000,000 5,000,000 610,000,000 2440 MB / counter
  • 37. 37/55 Remember: 1 machine = 25 shards 25 shards / 7 billion / 23TB Field References Max docs/term Unique terms domain 7,000,000,000 3,000,000 ~25,000,000 url 7,000,000,000 56,000 ~5,000,000,000 links 125,000,000,000 5,000,000 ~15,000,000,000 60 GB / facet call
  • 38. 38/55 Different distributions domain 1.1M url 200M links 600M High max Low max Very long tail Short tail
  • 39. 39/55 Theoretical lower limit per counter: log2(max_count) max=1 max=7 max=2047 max=3 max=63
  • 40. 40/55 int vs. PackedInts domain: 4 MB url: 780 MB links: 2350 MB int[ordinals] PackedInts(ordinals, maxBPV) domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%)
  • 41. 41/55 n-plane-z counters Platonic ideal Harsh reality Plane d Plane c Plane b Plane a
  • 42. 42/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000
  • 43. 43/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001
  • 44. 44/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011
  • 45. 45/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101
  • 46. 46/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101 L: 4 ≣ 000111 L: 5 ≣ 001001 L: 6 ≣ 001011 L: 7 ≣ 001101 ... L: 12 ≣ 010111
  • 47. 47/55 Comparison of counter structures domain: 4 MB url: 780 MB links: 2350 MB domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%) domain: 1 MB (30%) url: 66 MB ( 8%) links: 311 MB (13%) int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z
  • 49. 49/55 I could go on about  Threaded counting  Heuristic faceting  Fine count skipping  Counter capping  Monotonically increasing tracker for n-plane-z  Regexp filtering
  • 50. 50/55 What about huge result sets?  Rare for explorative term-based searches  Common for batch extractions  Threading works poorly as #shards > #CPUs  But how bad is it really?
  • 52. 52/55 Heuristic faceting  Use sampling to guess top-X terms − Re-use the existing tracked counters − 1:1000 sampling seems usable for the field links, which has 5 billion references per shard  Fine-count the guessed terms
  • 54. 54/55 10 seconds < 8 minutes
  • 55. 55/55 Never enough time, but talk to me about  Threaded counting  Monotonically increasing tracker for n-plane-z  Regexp filtering  Fine count skipping  Counter capping
  • 56. 56/55 Extra info The techniques presented can be tested with sparse faceting, available as a plug-in replacement WAR for Solr 4.10 at https://tokee.github.io/lucene-solr/. A version for Solr 5 will eventually be implemented, but the timeframe is unknown. No current plans for incorporating the full feature set in the official Solr distribution exists. Suggested approach for incorporation is to split it into multiple independent or semi- independent features, starting with those applicable to most people, such as the distributes faceting fine count optimization. In-depth descriptions and performance tests of the different features can be found at https://sbdevel.wordpress.com.
  • 57. 57/55 18M documents / 50GB, facet on 5 fields (2*10M values, 3*smaller)
  • 58. 58/55 6 billion docs / 20TB, 25 shards, single machine facet on 6 fields (1*4000M, 2*20M, 3*smaller)
  • 59. 59/55 7 billion docs / 23TB, 25 shards, single machine facet on 5 fields (2*20M, 3*smaller)

Editor's Notes

  1. “Solr at Scale for Time-Oriented data, Rocana” covers just about all, just nicer. Tika is the heavy part: 90% of indexing CPU power goes into Tika analysis .
  2. Static &amp; optimized shards No replicas (but we do have backup) Rarely more than 1 concurrent user
  3. Standard JRE 1.7 garbage collector – no tuning. Full GC means delay for the client. Standard GC means higher CPU load.
  4. Some info on JSON Faceting API and reusing at http://yonik.com/facet-performance/ The pool is responsible for cleaning the counter Counter cleaning is a background thread NOTE: Was I wrong about JSON faceting reuse?
  5. Note: It always takes at least 500ms in this test
  6. Note: It always takes at least 500ms in this test
  7. This scenario represents the highest faceting feature set we are currently willing to run on our net search. Fortunately the standard scenario is that more than 1 concurrent search is rare. Our established upper acceptable response time is 2 seconds (median), with no defined worst-case limit.
  8. Faceting on the links field requires 60GB of heap per concurrent call. While this might be technically feasible for our setup, it would leave very little memory available for disk cache.
  9. Not the true minimum, as we round up to nearest power of 2 minus 1
  10. Blue squares are overflow bits. Finding the index for the term in a higher plane is done by counting the number of overflow bits. Fortunately this can be done with a rank function (~3% memory overhead) in constant time. The standard tracker is not used, as it would require more heap than the counter structure itself. Instead a bitmap counter structure is used (1/64 overhead). Details about this counter structure is not part of this presentation.
  11. n-plane-z uses a little less than 2x theoretical min Multiple n-plane-z shares overflow-bits, so extra concurrent counters takes up only slightly more than the theoretical minimum amount of heap.
  12. Fine counting could be replaced with multiplication with 1/sampling_factor
  13. We want top-25, but ask for top-100 to raise the chances of getting the right terms Counts are guaranteed to be correct
  14. Bonus slide 1 Graphs from production core library search (books, articles etc) logs. Logs are taken from same week day, for 4 weeks. Blue, pink and green are response times with vanilla Solr. Orange is with sparse faceting.
  15. Bonus slide: The effect of artificially reducing the amount of memory available for disk caching. Reducing this below 50GB has severe performance implications. Morale: SSD allows for very low relative disk cache, but do not count on the performance relative to disk cache to be linear.
  16. Bonus slide. Performance of search with multiple concurrent users. Note that the large URL field is not part of faceting. This slide demonstrates performance for a more “normal” search situation on a machine with a relative small amount of disk cache.