Multiple time and space enhancements for high cardinality Solr String faceting. Presented at Berlin Buzzwords 2015.
Code available at http://tokee.github.io/lucene-solr/
1. 1/47
Solr sparse faceting
Everything counts in large amounts
@TokeEskildsen
State and University Library, Denmark
https://tokee.github.io/lucene-solr/
3. 3/47
Nothing To Fear
● 500TB+ web resources from Danish Net Archive
● Estimated 50TB Solr index data when finished
● 3 machines of 16 CPU cores, 256GB RAM, 25 * 900GB SSD
– Each machine holds: 25 Solrs
● Each Solr holds: 1 optimized shard with 900GB / 250M docs
● Shards build externally, one at a time
● (Optimizations also relevant for smaller setups)
4. 4/47
Pipeline
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3
11. 11/47
Counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3
19. 19/47
Get the Balance Right
Phase 1) All shards perform faceting
The Merger calculates top-X terms
Phase 2) The term counts are requested from the shards
that did not return them in phase 1
for term: query.getTerms()
result.add(term, searcher.numDocs(
query(field:term), base.getDocIDs()
).hitCount)
21. 21/47
Alternative fine counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter.increment(ordinal)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1
Highly scientific.
Very happy with Solr faceting for general work.
Principles primarily, applicable to non-Solr settings.
State & University + Royal Library.
Harvest 4 times/year.
Hardware: Save money.
Static shards.
Generation 1: 25 shards.
Generation 2: Currently 9 shards. We&apos;ll use this.
Apologies for small sample.
100GB free cache mem (0.5%-1.23% of index size)
Danish Net Archive: http://netarkivet.dk/in-english/
Next: We need aggregations, so faceting. String faceting is focus here.
Bird&apos;s eye view of vanilla Solr fc-faceting.
1) Allocate counter
2) Update counter
3) Extract top-X
DocValues all the way. Works really well for a few million values.
Next: Performance: 200M values.
Not time series, but bucketed by result size.
Most real-world results in the lower end.
Non-warmed, random queries, limit=25.
3 concurrent searches.
200M values.
Note the high variance for same-size result sets.
Standard JRE 1.7 garbage collector – no tuning.
Full GC means delay for the client.
Standard GC means higher CPU load.
Upon release, the counter is cleared by the pool. Faster than re-allocating.
Optionally the clearing is done in the background by a dedicated thread.
Without and with reuse of counters.
Probably possible to reduce full GC on reuse.
With and without reuse of counters.
The last we see of vanilla Solr (the tricks are stacked)
Quartiles plot. Box is 75%, median and 25%.
Lower median with reuse.
Smaller variance with reuse.
Downside: Static overhead.
Next: Why the 500ms lower bound?
Vanilla Solr extraction of top-X terms scales linear to the number of unique terms.
Previous chart: 200M terms = 200M counters.
Next: Track the changes.
No change as ord 3 has already been incremented.
Problem: Random memory access.
Slower beyond ~8% (so turn off automatically).
Non-sparse & sparse.
Downside: 8% overhead.
Next: Few Solr&apos;s stand alone - SolrCloud.
With distributed faceting, fine counting is necessary.
Pit of Pain coined by Andy Jackson, UKWA.
1K-10K very high, due to the highest chance of fine-counting.
&lt; 100: Every shard probably returned all know terms.
&gt; 100K: Most of the terms in top-X probably returned by most shards.
10M+: True load starts to show (rare).
Note: 1M cardinality field.
We count just skip finecounting, but we don&apos;t want to cheat that much.
Next: What to do about it?
Simple strategy: As phase 1 was fast, just repeat it and lookup the specific terms.
Since we already did the counting in phase 1, we&apos;ll just re-use the filled counter structure.
Cleanup problem handled by background thread.
Yes, re-use and sparse multipliers does stack.
Downside: Not much. Maybe large result sets?
Next: From performance to memory: Counter representation.
A typical shard from Net Archive Search v2.
References are from documents to terms.
High-cardinality in different ways.
We want to facet on url and links to visualize connectivity hotspots.
Next: All counters are not equal.
Common theme: Many with max count 1.
We flip the visualization.
Squares are counter-bits needed.
int[ordinals] uses 4 bytes/counter.
PackedInts uses as many bits per value as is needed to represent the highest value of any counter.This works well for fields with a low max count (such as url), but not so well for fields with a high max count (domain and links).
Note: The int[ordinals] is not to scale. It really goes up to 32.
Black squares are counter value bits.
Blue squares signify if the counter it belongs to is so large that it continues on the next plane. All previous blue squares must be counted. This requires a fast rank function to work (3% overhead).
Creation is slow! Full term-iteration.
Amortized plane-accesses for increments in n-plane is 2 (1 + 50% chance of p2 + 25% chance of p3 + ...). Each plane-shift requires rank lookup and overflow-bit lookup: 2 memory reads
Details at https://sbdevel.wordpress.com/2015/03/13/n-plane-packed-counters-for-faceting/
Start with 0.
Set first bit. No overflow -&gt; no problem.
Plane 1 overflows for L.
Count blue overflow bits: There are 5.
At position 5 in Plane 2, update the bits.
This works very well as most counters have a max of 1.
This works very well as most counters have a max of 1.
This works very well as most counters have a max of 1.
This works very well as most counters have a max of 1.
This works very well as most counters have a max of 1.
This works very well as most counters have a max of 1.
N-plane-z instantiations beyond the first re-uses the overflow bits.
Lowest possible: 0.7MB (15%), 27.9MB (4%), 132.3MB (6%)
Single shard.
640M values.
Obvious speed degradation that follows space savings.
Insignificant with small result sets due to sparse.
TODO: Worst case (*:*)
array:
packed:
nplanez
Next: Scale it up.
I can&apos;t see how this could possibly go wrong.
Worst case: About 50 billion references to 9 billion values must be processed. (10 minutes)
Two strategies presents themselves: Threading for exact counts and sampling. Threading is currently implemented.
Screenshot taken immediately after kitchen sink test.
Available on GitHub.
Currently Solr 4.8, promise to go for 4.10.4 Real Soon Now.
5.x later.