SlideShare a Scribd company logo
ENGINEERING FAST INDEXES (DEEP DIVE)
Daniel Lemire
https://lemire.me
Joint work with lots of super smart people
Roaring : Hybrid Model
A collection of containers...
array: sorted arrays ({1,20,144}) of packed 16‑bit integers
bitset: bitsets spanning 65536 bits or 1024 64‑bit words
run: sequences of runs ([0,10],[15,20])
2
Keeping track
E.g., a bitset with few 1s need to be converted back to array.
→ we need to keep track of the cardinality!
In Roaring, we do it automagically
3
Setting/Flipping/Clearing bits while keeping track
Important : avoid mispredicted branches
Pure C/Java:
q = p / 64
ow = w[ q ];
nw = ow | (1 << (p % 64) );
cardinality += (ow ^ nw) >> (p % 64) ; // EXTRA
w[ q ] = nw;
4
In x64 assembly with BMI instructions:
shrx %[6], %[p], %[q] // q = p / 64
mov (%[w],%[q],8), %[ow] // ow = w [q]
bts %[p], %[ow] // ow |= ( 1<< (p % 64)) + flag
sbb $-1, %[cardinality] // update card based on flag
mov %[load], (%[w],%[q],8) // w[q] = ow
 sbb is the extra work
5
For each operation
union
intersection
difference
...
Must specialize by container type:
array bitset run
array ? ? ?
bitset ? ? ?
run ? ? ?
6
High‑level API or Sipping Straw?
7
Bitset vs. Bitset...
Intersection:
First compute the cardinality of the result.
If low, use an array for the result (slow), otherwise generate
a bitset (fast).
Union: Always generate a bitset (fast).
(Unless cardinality is high then maybe create a run!)
We generally keep track of the cardinality of the result.
8
Cardinality of the result
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
We have 1024 calls to  Long.bitCount .
This counts the number of 1s in a 64‑bit word.
9
Population count in Java
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
Sounds expensive?
10
Population count in C
How do you think that the C compiler  clang compiles this code?
#include <stdint.h>
int count(uint64_t x) {
int v = 0;
while(x != 0) {
x &= x - 1;
v++;
}
return v;
}
11
Compile with  -O1 -march=native on a recent x64 machine:
popcnt rax, rdi
12
Why care for  popcnt ?
 popcnt : throughput of 1 instruction per cycle (recent Intel CPUs)
Really fast.
13
Population count in Java?
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
14
Population count in Java!
Also compiles to  popcnt if hardware supports it
$ java -XX:+PrintFlagsFinal
| grep UsePopCountInstruction
bool UsePopCountInstruction = true
But only if you call it from  Long.bitCount 
15
Java intrinsics
 Long.bitCount ,  Integer.bitCount 
 Integer.reverseBytes ,  Long.reverseBytes 
 Integer.numberOfLeadingZeros ,
 Long.numberOfLeadingZeros 
 Integer.numberOfTrailingZeros ,
 Long.numberOfTrailingZeros 
 System.arraycopy 
...
16
Cardinality of the intersection
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
A bit over ≈ 2 cycles per pair of 64‑bit words.
load A, load B
bitwise AND
 popcnt 
17
Take away
Bitset vs. Bitset operations are fast
even if you need to track the cardinality.
even in Java
e.g.,  popcnt overhead might be negligible compared to other costs
like cache misses.
18
Array vs. Array intersection
Always output an array. Use galloping O(m log n) if the sizes
differs a lot.
int intersect(A, B) {
if (A.length * 25 < B.length) {
return galloping(A,B);
} else if (B.length * 25 < A.length) {
return galloping(B,A);
} else {
return boring_intersection(A,B);
}
}
19
Galloping intersection
You have two arrays a small and a large one...
while (true) {
if (largeSet[k1] < smallSet[k2]) {
find k1 by binary search such that
largeSet[k1] >= smallSet[k2]
}
if (smallSet[k2] < largeSet[k1]) {
++k2;
} else {
// got a match! (smallSet[k2] == largeSet[k1])
}
}
If the small set is tiny, runs in O(log(size of big set))
20
Array vs. Array union
Union: If sum of cardinalities is large, go for a bitset. Revert to an
array if we got it wrong.
union (A,B) {
total = A.length + B.length;
if (total > DEFAULT_MAX_SIZE) {// bitmap?
create empty bitmap C and add both A and B to it
if (C.cardinality <= DEFAULT_MAX_SIZE) {
convert C to array
} else if (C is full) {
convert C to run
} else {
C is fine as a bitmap
}
}
otherwise merge two arrays and output array
}
21
Array vs. Bitmap (Intersection)...
Intersection: Always an array.
Branchy (3 to 16 cycles per array value):
answer = new array
for value in array {
if value in bitset {
append value to answer
}
}
22
Branchless (3 cycles per array value):
answer = new array
pos = 0
for value in array {
answer[pos] = value
pos += bit_value(bitset, value)
}
23
Array vs. Bitmap (Union)...
Always a bitset. Very fast. Few cycles per value in array.
answer = clone the bitset
for value in array { // branchless
set bit in answer at index value
}
Without tracking the cardinality ≈ 1.65 cycles per value
Tracking the cardinality ≈ 2.2 cycles per value
24
Parallelization is not just multicore + distributed
In practice, all commodity processors support Single instruction,
multiple data (SIMD) instructions.
Raspberry Pi
Your phone
Your PC
Working with words x × larger has the potential of multiplying the
performance by x.
No lock needed.
Purely deterministic/testable.
25
SIMD is not too hard conceptually
Instead of working with x + y you do
(x , x , x , x ) + (y , y , y , y ).
Alas: it is messy in actual code.
1 2 3 4 1 2 3 4
26
With SIMD small words help!
With scalar code, working on 16‑bit integers is not 2 × faster than
32‑bit integers.
But with SIMD instructions, going from 64‑bit integers to 16‑bit
integers can mean 4 × gain.
Roaring uses arrays of 16‑bit integers.
27
Bitsets are vectorizable
Logical ORs, ANDs, ANDNOTs, XORs can be computed fast with
Single instruction, multiple data (SIMD) instructions.
Intel Cannonlake (late 2017), AVX‑512
Operate on 64 bytes with ONE instruction
→ Several 512‑bit ops/cycle
Java 9's Hotspot can use AVX 512
ARM v8‑A to get Scalable Vector Extension...
up to 2048 bits!!!
28
Java supports advanced SIMD instructions
$ java -XX:+PrintFlagsFinal -version |grep "AVX"
intx UseAVX = 2
29
Vectorization matters!
for(size_t i = 0; i < len; i++) {
a[i] |= b[i];
}
using scalar : 1.5 cycles per byte
with AVX2 : 0.43 cycles per byte (3.5 × better)
With AVX‑512, the performance gap exceeds 5 ×
Can also vectorize OR, AND, ANDNOT, XOR + population count
(AVX2‑Harley‑Seal)
30
Vectorization beats  popcnt 
int count = 0;
for(size_t i = 0; i < len; i++) {
count += popcount(a[i]);
}
using fast scalar (popcnt): 1 cycle per input byte
using AVX2 Harley‑Seal: 0.5 cycles per input byte
even greater gain with AVX‑512
31
Sorted arrays
sorted arrays are vectorizable:
array union
array difference
array symmetric difference
array intersection
sorted arrays can be compressed with SIMD
32
Bitsets are vectorizable... sadly...
Java's hotspot is limited in what it can autovectorize:
1. Copying arrays
2. String.indexOf
3. ...
And it seems that  Unsafe effectively disables autovectorization!
33
There is hope yet for Java
One big reason, today, for binding closely to hardware is to
process wider data flows in SIMD modes. (And IMO this is a
long‑term trend towards right‑sizing data channel widths, as
hardware grows wider in various ways.) AVX bindings are where
we are experimenting, today
(John Rose, Oracle)
34
Fun things you can do with SIMD: Masked VByte
Consider the ubiquitous VByte format:
Use 1 byte to store all integers in [0, 2 )
Use 2 bytes to store all integers in [2 , 2 )
...
Decoding can become a bottleneck. Google developed Varint‑GB.
What if you are stuck with the conventional format? (E.g., Lucene,
LEB128, Protocol Buffers...)
7
7 14
35
Masked VByte
Joint work with J. Plaisance (Indeed.com) and N. Kurz.
http://maskedvbyte.org/
36
Go try it out!
Fully vectorized Roaring implementation (C/C++):
https://github.com/RoaringBitmap/CRoaring
Wrappers in Python, Go, Rust...
37

More Related Content

What's hot

Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
Microsoft Developer Network (MSDN) - Belgium and Luxembourg
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in Java
Doug Hawkins
 
Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic Languages
Tobias Lindaaker
 
[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent ProgrammingTobias Lindaaker
 
Virtual machine and javascript engine
Virtual machine and javascript engineVirtual machine and javascript engine
Virtual machine and javascript engine
Duoyi Wu
 
Grand Central Dispatch - iOS Conf SG 2015
Grand Central Dispatch - iOS Conf SG 2015Grand Central Dispatch - iOS Conf SG 2015
Grand Central Dispatch - iOS Conf SG 2015
Ben Asher
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012
aleks-f
 
Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017 Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017
pramode_ce
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
Yung-Yu Chen
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
Suman Karumuri
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ScyllaDB
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
Electronic Arts / DICE
 
Nicety of Java 8 Multithreading
Nicety of Java 8 MultithreadingNicety of Java 8 Multithreading
Nicety of Java 8 Multithreading
GlobalLogic Ukraine
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
Electronic Arts / DICE
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013
aleks-f
 
Blocks & GCD
Blocks & GCDBlocks & GCD
Blocks & GCD
rsebbe
 
Facebook C++网络库Wangle调研
Facebook C++网络库Wangle调研Facebook C++网络库Wangle调研
Facebook C++网络库Wangle调研
vorfeed chen
 
Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...
Robert Schadek
 
RealmDB for Android
RealmDB for AndroidRealmDB for Android
RealmDB for Android
GlobalLogic Ukraine
 

What's hot (20)

Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in Java
 
Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic Languages
 
[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming
 
Virtual machine and javascript engine
Virtual machine and javascript engineVirtual machine and javascript engine
Virtual machine and javascript engine
 
Grand Central Dispatch - iOS Conf SG 2015
Grand Central Dispatch - iOS Conf SG 2015Grand Central Dispatch - iOS Conf SG 2015
Grand Central Dispatch - iOS Conf SG 2015
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012
 
Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017 Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
 
Nicety of Java 8 Multithreading
Nicety of Java 8 MultithreadingNicety of Java 8 Multithreading
Nicety of Java 8 Multithreading
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013
 
Blocks & GCD
Blocks & GCDBlocks & GCD
Blocks & GCD
 
Facebook C++网络库Wangle调研
Facebook C++网络库Wangle调研Facebook C++网络库Wangle调研
Facebook C++网络库Wangle调研
 
Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...
 
RealmDB for Android
RealmDB for AndroidRealmDB for Android
RealmDB for Android
 

Similar to Engineering fast indexes (Deepdive)

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
Daniel Lemire
 
Xgboost
XgboostXgboost
Ijmsr 2016-05
Ijmsr 2016-05Ijmsr 2016-05
Ijmsr 2016-05
ijmsr
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
Fwdays
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
A nice 64-bit error in C
A  nice 64-bit error in CA  nice 64-bit error in C
A nice 64-bit error in C
PVS-Studio
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
Satalia
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Graham Wihlidal
 
PVS-Studio for Linux Went on a Tour Around Disney
PVS-Studio for Linux Went on a Tour Around DisneyPVS-Studio for Linux Went on a Tour Around Disney
PVS-Studio for Linux Went on a Tour Around Disney
PVS-Studio
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Daniel Lemire
 
Lesson 24. Phantom errors
Lesson 24. Phantom errorsLesson 24. Phantom errors
Lesson 24. Phantom errors
PVS-Studio
 
Design of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition TechniqueDesign of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition Technique
Kumar Goud
 
Design of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition TechniqueDesign of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition Technique
Kumar Goud
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 
Java Keeps Throttling Up!
Java Keeps Throttling Up!Java Keeps Throttling Up!
Java Keeps Throttling Up!
José Paumard
 
Vectorized VByte Decoding
Vectorized VByte DecodingVectorized VByte Decoding
Vectorized VByte Decoding
indeedeng
 
Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2
Vijayananda Mohire
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520IJRAT
 

Similar to Engineering fast indexes (Deepdive) (20)

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Xgboost
XgboostXgboost
Xgboost
 
Ijmsr 2016-05
Ijmsr 2016-05Ijmsr 2016-05
Ijmsr 2016-05
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A nice 64-bit error in C
A  nice 64-bit error in CA  nice 64-bit error in C
A nice 64-bit error in C
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
PVS-Studio for Linux Went on a Tour Around Disney
PVS-Studio for Linux Went on a Tour Around DisneyPVS-Studio for Linux Went on a Tour Around Disney
PVS-Studio for Linux Went on a Tour Around Disney
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Lesson 24. Phantom errors
Lesson 24. Phantom errorsLesson 24. Phantom errors
Lesson 24. Phantom errors
 
Design of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition TechniqueDesign of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition Technique
 
Design of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition TechniqueDesign of QSD Number System Addition using Delayed Addition Technique
Design of QSD Number System Addition using Delayed Addition Technique
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Java Keeps Throttling Up!
Java Keeps Throttling Up!Java Keeps Throttling Up!
Java Keeps Throttling Up!
 
Vectorized VByte Decoding
Vectorized VByte DecodingVectorized VByte Decoding
Vectorized VByte Decoding
 
Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2Quantum Computing Notes Ver 1.2
Quantum Computing Notes Ver 1.2
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520
 

More from Daniel Lemire

Accurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarksAccurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarks
Daniel Lemire
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
Daniel Lemire
 
Parsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons LearnedParsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons Learned
Daniel Lemire
 
Ingénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnéesIngénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnées
Daniel Lemire
 
SIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted IntegersSIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted Integers
Daniel Lemire
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorizationDecoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
Daniel Lemire
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Daniel Lemire
 
MaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByteMaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByte
Daniel Lemire
 
Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)
Daniel Lemire
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 report
Daniel Lemire
 
La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression
Daniel Lemire
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization  Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
Daniel Lemire
 
Extracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific DataExtracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific Data
Daniel Lemire
 
Innovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQLInnovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQL
Daniel Lemire
 
Write good papers
Write good papersWrite good papers
Write good papers
Daniel Lemire
 
Faster Column-Oriented Indexes
Faster Column-Oriented IndexesFaster Column-Oriented Indexes
Faster Column-Oriented Indexes
Daniel Lemire
 
Compressing column-oriented indexes
Compressing column-oriented indexesCompressing column-oriented indexes
Compressing column-oriented indexes
Daniel Lemire
 
All About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting ThemAll About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting Them
Daniel Lemire
 
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAPA Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
Daniel Lemire
 

More from Daniel Lemire (20)

Accurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarksAccurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarks
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
Parsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons LearnedParsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons Learned
 
Ingénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnéesIngénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnées
 
SIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted IntegersSIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted Integers
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorizationDecoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
 
MaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByteMaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByte
 
Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 report
 
La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression
 
OLAP and more
OLAP and moreOLAP and more
OLAP and more
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization  Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
 
Extracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific DataExtracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific Data
 
Innovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQLInnovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQL
 
Write good papers
Write good papersWrite good papers
Write good papers
 
Faster Column-Oriented Indexes
Faster Column-Oriented IndexesFaster Column-Oriented Indexes
Faster Column-Oriented Indexes
 
Compressing column-oriented indexes
Compressing column-oriented indexesCompressing column-oriented indexes
Compressing column-oriented indexes
 
All About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting ThemAll About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting Them
 
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAPA Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 

Engineering fast indexes (Deepdive)

  • 1. ENGINEERING FAST INDEXES (DEEP DIVE) Daniel Lemire https://lemire.me Joint work with lots of super smart people
  • 2. Roaring : Hybrid Model A collection of containers... array: sorted arrays ({1,20,144}) of packed 16‑bit integers bitset: bitsets spanning 65536 bits or 1024 64‑bit words run: sequences of runs ([0,10],[15,20]) 2
  • 3. Keeping track E.g., a bitset with few 1s need to be converted back to array. → we need to keep track of the cardinality! In Roaring, we do it automagically 3
  • 4. Setting/Flipping/Clearing bits while keeping track Important : avoid mispredicted branches Pure C/Java: q = p / 64 ow = w[ q ]; nw = ow | (1 << (p % 64) ); cardinality += (ow ^ nw) >> (p % 64) ; // EXTRA w[ q ] = nw; 4
  • 5. In x64 assembly with BMI instructions: shrx %[6], %[p], %[q] // q = p / 64 mov (%[w],%[q],8), %[ow] // ow = w [q] bts %[p], %[ow] // ow |= ( 1<< (p % 64)) + flag sbb $-1, %[cardinality] // update card based on flag mov %[load], (%[w],%[q],8) // w[q] = ow  sbb is the extra work 5
  • 6. For each operation union intersection difference ... Must specialize by container type: array bitset run array ? ? ? bitset ? ? ? run ? ? ? 6
  • 7. High‑level API or Sipping Straw? 7
  • 8. Bitset vs. Bitset... Intersection: First compute the cardinality of the result. If low, use an array for the result (slow), otherwise generate a bitset (fast). Union: Always generate a bitset (fast). (Unless cardinality is high then maybe create a run!) We generally keep track of the cardinality of the result. 8
  • 9. Cardinality of the result How fast does this code run? int c = 0; for (int k = 0; k < 1024; ++k) { c += Long.bitCount(A[k] & B[k]); } We have 1024 calls to  Long.bitCount . This counts the number of 1s in a 64‑bit word. 9
  • 10. Population count in Java // Hacker`s Delight int bitCount(long i) { // HD, Figure 5-14 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f; } Sounds expensive? 10
  • 11. Population count in C How do you think that the C compiler  clang compiles this code? #include <stdint.h> int count(uint64_t x) { int v = 0; while(x != 0) { x &= x - 1; v++; } return v; } 11
  • 12. Compile with  -O1 -march=native on a recent x64 machine: popcnt rax, rdi 12
  • 13. Why care for  popcnt ?  popcnt : throughput of 1 instruction per cycle (recent Intel CPUs) Really fast. 13
  • 14. Population count in Java? // Hacker`s Delight int bitCount(long i) { // HD, Figure 5-14 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f; } 14
  • 15. Population count in Java! Also compiles to  popcnt if hardware supports it $ java -XX:+PrintFlagsFinal | grep UsePopCountInstruction bool UsePopCountInstruction = true But only if you call it from  Long.bitCount  15
  • 16. Java intrinsics  Long.bitCount ,  Integer.bitCount   Integer.reverseBytes ,  Long.reverseBytes   Integer.numberOfLeadingZeros ,  Long.numberOfLeadingZeros   Integer.numberOfTrailingZeros ,  Long.numberOfTrailingZeros   System.arraycopy  ... 16
  • 17. Cardinality of the intersection How fast does this code run? int c = 0; for (int k = 0; k < 1024; ++k) { c += Long.bitCount(A[k] & B[k]); } A bit over ≈ 2 cycles per pair of 64‑bit words. load A, load B bitwise AND  popcnt  17
  • 18. Take away Bitset vs. Bitset operations are fast even if you need to track the cardinality. even in Java e.g.,  popcnt overhead might be negligible compared to other costs like cache misses. 18
  • 19. Array vs. Array intersection Always output an array. Use galloping O(m log n) if the sizes differs a lot. int intersect(A, B) { if (A.length * 25 < B.length) { return galloping(A,B); } else if (B.length * 25 < A.length) { return galloping(B,A); } else { return boring_intersection(A,B); } } 19
  • 20. Galloping intersection You have two arrays a small and a large one... while (true) { if (largeSet[k1] < smallSet[k2]) { find k1 by binary search such that largeSet[k1] >= smallSet[k2] } if (smallSet[k2] < largeSet[k1]) { ++k2; } else { // got a match! (smallSet[k2] == largeSet[k1]) } } If the small set is tiny, runs in O(log(size of big set)) 20
  • 21. Array vs. Array union Union: If sum of cardinalities is large, go for a bitset. Revert to an array if we got it wrong. union (A,B) { total = A.length + B.length; if (total > DEFAULT_MAX_SIZE) {// bitmap? create empty bitmap C and add both A and B to it if (C.cardinality <= DEFAULT_MAX_SIZE) { convert C to array } else if (C is full) { convert C to run } else { C is fine as a bitmap } } otherwise merge two arrays and output array } 21
  • 22. Array vs. Bitmap (Intersection)... Intersection: Always an array. Branchy (3 to 16 cycles per array value): answer = new array for value in array { if value in bitset { append value to answer } } 22
  • 23. Branchless (3 cycles per array value): answer = new array pos = 0 for value in array { answer[pos] = value pos += bit_value(bitset, value) } 23
  • 24. Array vs. Bitmap (Union)... Always a bitset. Very fast. Few cycles per value in array. answer = clone the bitset for value in array { // branchless set bit in answer at index value } Without tracking the cardinality ≈ 1.65 cycles per value Tracking the cardinality ≈ 2.2 cycles per value 24
  • 25. Parallelization is not just multicore + distributed In practice, all commodity processors support Single instruction, multiple data (SIMD) instructions. Raspberry Pi Your phone Your PC Working with words x × larger has the potential of multiplying the performance by x. No lock needed. Purely deterministic/testable. 25
  • 26. SIMD is not too hard conceptually Instead of working with x + y you do (x , x , x , x ) + (y , y , y , y ). Alas: it is messy in actual code. 1 2 3 4 1 2 3 4 26
  • 27. With SIMD small words help! With scalar code, working on 16‑bit integers is not 2 × faster than 32‑bit integers. But with SIMD instructions, going from 64‑bit integers to 16‑bit integers can mean 4 × gain. Roaring uses arrays of 16‑bit integers. 27
  • 28. Bitsets are vectorizable Logical ORs, ANDs, ANDNOTs, XORs can be computed fast with Single instruction, multiple data (SIMD) instructions. Intel Cannonlake (late 2017), AVX‑512 Operate on 64 bytes with ONE instruction → Several 512‑bit ops/cycle Java 9's Hotspot can use AVX 512 ARM v8‑A to get Scalable Vector Extension... up to 2048 bits!!! 28
  • 29. Java supports advanced SIMD instructions $ java -XX:+PrintFlagsFinal -version |grep "AVX" intx UseAVX = 2 29
  • 30. Vectorization matters! for(size_t i = 0; i < len; i++) { a[i] |= b[i]; } using scalar : 1.5 cycles per byte with AVX2 : 0.43 cycles per byte (3.5 × better) With AVX‑512, the performance gap exceeds 5 × Can also vectorize OR, AND, ANDNOT, XOR + population count (AVX2‑Harley‑Seal) 30
  • 31. Vectorization beats  popcnt  int count = 0; for(size_t i = 0; i < len; i++) { count += popcount(a[i]); } using fast scalar (popcnt): 1 cycle per input byte using AVX2 Harley‑Seal: 0.5 cycles per input byte even greater gain with AVX‑512 31
  • 32. Sorted arrays sorted arrays are vectorizable: array union array difference array symmetric difference array intersection sorted arrays can be compressed with SIMD 32
  • 33. Bitsets are vectorizable... sadly... Java's hotspot is limited in what it can autovectorize: 1. Copying arrays 2. String.indexOf 3. ... And it seems that  Unsafe effectively disables autovectorization! 33
  • 34. There is hope yet for Java One big reason, today, for binding closely to hardware is to process wider data flows in SIMD modes. (And IMO this is a long‑term trend towards right‑sizing data channel widths, as hardware grows wider in various ways.) AVX bindings are where we are experimenting, today (John Rose, Oracle) 34
  • 35. Fun things you can do with SIMD: Masked VByte Consider the ubiquitous VByte format: Use 1 byte to store all integers in [0, 2 ) Use 2 bytes to store all integers in [2 , 2 ) ... Decoding can become a bottleneck. Google developed Varint‑GB. What if you are stuck with the conventional format? (E.g., Lucene, LEB128, Protocol Buffers...) 7 7 14 35
  • 36. Masked VByte Joint work with J. Plaisance (Indeed.com) and N. Kurz. http://maskedvbyte.org/ 36
  • 37. Go try it out! Fully vectorized Roaring implementation (C/C++): https://github.com/RoaringBitmap/CRoaring Wrappers in Python, Go, Rust... 37