SlideShare a Scribd company logo
1 of 47
1/47
Solr sparse faceting
Everything counts in large amounts
@TokeEskildsen
State and University Library, Denmark
https://tokee.github.io/lucene-solr/
2/47
Presentation tech level
3/47
Nothing To Fear
● 500TB+ web resources from Danish Net Archive
● Estimated 50TB Solr index data when finished
● 3 machines of 16 CPU cores, 256GB RAM, 25 * 900GB SSD
– Each machine holds: 25 Solrs
● Each Solr holds: 1 optimized shard with 900GB / 250M docs
● Shards build externally, one at a time
● (Optimizations also relevant for smaller setups)
4/47
Pipeline
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3
5/47
6/47
7/47
Recycle
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
8/47
9/47
10/47
11/47
Counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3
12/47
ord counter
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
tracker
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
13/47
ord counter
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
14/47
ord counter
0 0
1 1
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
15/47
ord counter
0 0
1 1
2 0
3 2
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
16/47
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
17/47
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
if tracked < maxTracked
for i = 0 ; i < tracked ; i++
priorityQueue.add(tracker[i], counter[tracker[i]])
else
for ordinal = 0 ; ordinal < counter.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
Sparse counting
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
18/47
19/47
Get the Balance Right
Phase 1) All shards perform faceting
The Merger calculates top-X terms
Phase 2) The term counts are requested from the shards
that did not return them in phase 1
for term: query.getTerms()
result.add(term, searcher.numDocs(
query(field:term), base.getDocIDs()
).hitCount)
20/47
Pit of Pain™
21/47
Alternative fine counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter.increment(ordinal)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1
22/47
Stripped
counter = pool.getCounter(key)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))
23/47
Plain of Platitude™
24/47
250,000,000 docs / 900GB, optimized
Field References Max docs/term Terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
25/47
Term distributions
domain 1.1M url 200M links 600M
26/47
Clean
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:26:18 2015
27/47
World Full Of Nothing
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 14:55:25 2015
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 14:56:08 2015
domain: 4 MB
url: 780 MB
links: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
28/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:26:18 2015
Platonic ideal Harsh reality
Plane 4
Plane 3
Plane 2
Plane 1
Construction Time Again
29/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015Plane 4
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
30/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015Plane 4
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
31/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015Plane 4
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010
32/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015Plane 4
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010
L: 3 ≣ 000011
33/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015Plane 4
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010
L: 3 ≣ 000011
L: 4 ≣ 000100
L: 5 ≣ 000101
L: 6 ≣ 000110
L: 7 ≣ 000111
...
L: 12 ≣ 001100
34/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015Plane 4
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010
L: 3 ≣ 000011
L: 4 ≣ 000100
L: 5 ≣ 000101
L: 6 ≣ 000110
L: 7 ≣ 000111
...
L: 12 ≣ 001100
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
?
35/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:39:03 2015
Now This is Fun
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015
Plane 1
Plane 2
Plane 3
Plane 4
36/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:39:03 2015
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
37/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:39:03 2015
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001
38/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:39:03 2015
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
39/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:39:03 2015
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
40/47
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:39:03 2015
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
L: 4 ≣ 000111
L: 5 ≣ 001001
L: 6 ≣ 001011
L: 7 ≣ 001101
...
L: 12 ≣ 010111
41/47
The Bottom Line
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:39:03 2015
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 14:55:25 2015
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 14:56:08 2015
domain: 4 MB
url: 780 MB
links: 2350 MB
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
domain: 1 MB (30%)
url: 66 MB ( 8%)
links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) N-plane-z
42/47
43/47
Kitchen sink
250,000,000 docs / 900GB, optimized
Field References Max docs/term Terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
x 9 shards
x 3 concurrent requests
44/47
Shouldn't Have Done That
45/47
Some Great Reward
8GB heap per
900GB shard
46/47
Dream On
● Threaded counting
● Monotonically increasing tracker for nplane-z
● Regexp filtering
● Fine count skipping
● Counter capping
47/47
Nothing's Impossible

More Related Content

What's hot

Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Yandex
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
 
Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...DefCamp
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
Kernel-Level Programming: Entering Ring Naught
Kernel-Level Programming: Entering Ring NaughtKernel-Level Programming: Entering Ring Naught
Kernel-Level Programming: Entering Ring NaughtDavid Evans
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)David Evans
 
Brace yourselves, leap second is coming
Brace yourselves, leap second is comingBrace yourselves, leap second is coming
Brace yourselves, leap second is comingNati Cohen
 
RxJava In Baby Steps
RxJava In Baby StepsRxJava In Baby Steps
RxJava In Baby StepsAnnyce Davis
 
The Ring programming language version 1.5.4 book - Part 26 of 185
The Ring programming language version 1.5.4 book - Part 26 of 185The Ring programming language version 1.5.4 book - Part 26 of 185
The Ring programming language version 1.5.4 book - Part 26 of 185Mahmoud Samir Fayed
 
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataDAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataMuhammad Saleem
 
The Ring programming language version 1.8 book - Part 30 of 202
The Ring programming language version 1.8 book - Part 30 of 202The Ring programming language version 1.8 book - Part 30 of 202
The Ring programming language version 1.8 book - Part 30 of 202Mahmoud Samir Fayed
 
Verification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsVerification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsMykola Novik
 
Getting started cpp full
Getting started cpp   fullGetting started cpp   full
Getting started cpp fullVõ Hòa
 

What's hot (20)

Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace
 
The Internet
The InternetThe Internet
The Internet
 
Managing Memory
Managing MemoryManaging Memory
Managing Memory
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
Kernel-Level Programming: Entering Ring Naught
Kernel-Level Programming: Entering Ring NaughtKernel-Level Programming: Entering Ring Naught
Kernel-Level Programming: Entering Ring Naught
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
7 tcp-congestion
7 tcp-congestion7 tcp-congestion
7 tcp-congestion
 
Brace yourselves, leap second is coming
Brace yourselves, leap second is comingBrace yourselves, leap second is coming
Brace yourselves, leap second is coming
 
RxJava In Baby Steps
RxJava In Baby StepsRxJava In Baby Steps
RxJava In Baby Steps
 
Wepwhacker !
Wepwhacker !Wepwhacker !
Wepwhacker !
 
The Ring programming language version 1.5.4 book - Part 26 of 185
The Ring programming language version 1.5.4 book - Part 26 of 185The Ring programming language version 1.5.4 book - Part 26 of 185
The Ring programming language version 1.5.4 book - Part 26 of 185
 
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataDAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
 
LCS35
LCS35LCS35
LCS35
 
4 transport-sharing
4 transport-sharing4 transport-sharing
4 transport-sharing
 
The Ring programming language version 1.8 book - Part 30 of 202
The Ring programming language version 1.8 book - Part 30 of 202The Ring programming language version 1.8 book - Part 30 of 202
The Ring programming language version 1.8 book - Part 30 of 202
 
Verification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsVerification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed Systems
 
Getting started cpp full
Getting started cpp   fullGetting started cpp   full
Getting started cpp full
 

Similar to Solr sparse faceting

Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"LogeekNightUkraine
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
 
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Lucidworks
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
An introduction to knitr and R Markdown
An introduction to knitr and R MarkdownAn introduction to knitr and R Markdown
An introduction to knitr and R Markdownsahirbhatnagar
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
 
r2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyr2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyRay Song
 
Neo4j after 1 year in production
Neo4j after 1 year in productionNeo4j after 1 year in production
Neo4j after 1 year in productionAndrew Nikishaev
 
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...Javed Barkatullah
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAmazon Web Services
 
Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010Dirkjan Bussink
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018
Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018
Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018Codemotion
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for SolrToke Eskildsen
 
More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...Alex Sadovsky
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Anne Nicolas
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing LandscapeSasha Goldshtein
 

Similar to Solr sparse faceting (20)

Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTES
 
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
An introduction to knitr and R Markdown
An introduction to knitr and R MarkdownAn introduction to knitr and R Markdown
An introduction to knitr and R Markdown
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
 
r2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyr2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCy
 
Neo4j after 1 year in production
Neo4j after 1 year in productionNeo4j after 1 year in production
Neo4j after 1 year in production
 
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
lec16-memory.ppt
lec16-memory.pptlec16-memory.ppt
lec16-memory.ppt
 
Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018
Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018
Kenneth Truyers - Using Git as a NoSql database - Codemotion Milan 2018
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for Solr
 
More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...More Data, More Problems: Evolving big data machine learning pipelines with S...
More Data, More Problems: Evolving big data machine learning pipelines with S...
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 

Recently uploaded

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Solr sparse faceting

Editor's Notes

  1. Highly scientific. Very happy with Solr faceting for general work. Principles primarily, applicable to non-Solr settings.
  2. State &amp; University + Royal Library. Harvest 4 times/year. Hardware: Save money. Static shards. Generation 1: 25 shards. Generation 2: Currently 9 shards. We&amp;apos;ll use this. Apologies for small sample. 100GB free cache mem (0.5%-1.23% of index size) Danish Net Archive: http://netarkivet.dk/in-english/ Next: We need aggregations, so faceting. String faceting is focus here.
  3. Bird&amp;apos;s eye view of vanilla Solr fc-faceting. 1) Allocate counter 2) Update counter 3) Extract top-X DocValues all the way. Works really well for a few million values. Next: Performance: 200M values.
  4. Not time series, but bucketed by result size. Most real-world results in the lower end. Non-warmed, random queries, limit=25. 3 concurrent searches. 200M values. Note the high variance for same-size result sets.
  5. Standard JRE 1.7 garbage collector – no tuning. Full GC means delay for the client. Standard GC means higher CPU load.
  6. Upon release, the counter is cleared by the pool. Faster than re-allocating. Optionally the clearing is done in the background by a dedicated thread.
  7. Without and with reuse of counters. Probably possible to reduce full GC on reuse.
  8. With and without reuse of counters. The last we see of vanilla Solr (the tricks are stacked)
  9. Quartiles plot. Box is 75%, median and 25%. Lower median with reuse. Smaller variance with reuse. Downside: Static overhead. Next: Why the 500ms lower bound?
  10. Vanilla Solr extraction of top-X terms scales linear to the number of unique terms. Previous chart: 200M terms = 200M counters. Next: Track the changes.
  11. No change as ord 3 has already been incremented.
  12. Problem: Random memory access. Slower beyond ~8% (so turn off automatically).
  13. Non-sparse &amp; sparse. Downside: 8% overhead. Next: Few Solr&amp;apos;s stand alone - SolrCloud.
  14. With distributed faceting, fine counting is necessary.
  15. Pit of Pain coined by Andy Jackson, UKWA. 1K-10K very high, due to the highest chance of fine-counting. &amp;lt; 100: Every shard probably returned all know terms. &amp;gt; 100K: Most of the terms in top-X probably returned by most shards. 10M+: True load starts to show (rare). Note: 1M cardinality field. We count just skip finecounting, but we don&amp;apos;t want to cheat that much. Next: What to do about it?
  16. Simple strategy: As phase 1 was fast, just repeat it and lookup the specific terms.
  17. Since we already did the counting in phase 1, we&amp;apos;ll just re-use the filled counter structure. Cleanup problem handled by background thread.
  18. Yes, re-use and sparse multipliers does stack. Downside: Not much. Maybe large result sets? Next: From performance to memory: Counter representation.
  19. A typical shard from Net Archive Search v2. References are from documents to terms. High-cardinality in different ways. We want to facet on url and links to visualize connectivity hotspots. Next: All counters are not equal.
  20. Common theme: Many with max count 1.
  21. We flip the visualization. Squares are counter-bits needed.
  22. int[ordinals] uses 4 bytes/counter. PackedInts uses as many bits per value as is needed to represent the highest value of any counter.This works well for fields with a low max count (such as url), but not so well for fields with a high max count (domain and links). Note: The int[ordinals] is not to scale. It really goes up to 32.
  23. Black squares are counter value bits. Blue squares signify if the counter it belongs to is so large that it continues on the next plane. All previous blue squares must be counted. This requires a fast rank function to work (3% overhead). Creation is slow! Full term-iteration. Amortized plane-accesses for increments in n-plane is 2 (1 + 50% chance of p2 + 25% chance of p3 + ...). Each plane-shift requires rank lookup and overflow-bit lookup: 2 memory reads Details at https://sbdevel.wordpress.com/2015/03/13/n-plane-packed-counters-for-faceting/
  24. Start with 0.
  25. Set first bit. No overflow -&amp;gt; no problem.
  26. Plane 1 overflows for L. Count blue overflow bits: There are 5. At position 5 in Plane 2, update the bits.
  27. This works very well as most counters have a max of 1.
  28. This works very well as most counters have a max of 1.
  29. This works very well as most counters have a max of 1.
  30. This works very well as most counters have a max of 1.
  31. This works very well as most counters have a max of 1.
  32. This works very well as most counters have a max of 1.
  33. N-plane-z instantiations beyond the first re-uses the overflow bits. Lowest possible: 0.7MB (15%), 27.9MB (4%), 132.3MB (6%)
  34. Single shard. 640M values. Obvious speed degradation that follows space savings. Insignificant with small result sets due to sparse. TODO: Worst case (*:*) array: packed: nplanez Next: Scale it up.
  35. I can&amp;apos;t see how this could possibly go wrong.
  36. Worst case: About 50 billion references to 9 billion values must be processed. (10 minutes) Two strategies presents themselves: Threading for exact counts and sampling. Threading is currently implemented.
  37. Screenshot taken immediately after kitchen sink test.
  38. Available on GitHub. Currently Solr 4.8, promise to go for 4.10.4 Real Soon Now. 5.x later.