Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
RedisConf18 - Implementing a New Data Structure for Redis
1. Implementing a New Data
Structure for Redis
as an Inexperienced C Programmer
2. Hi, I’m Loris
● BSc in CS (Bioinformatics)
● Some experience in Data Analysis
(Bio & Fin) and Distributed Systems
● Python, JS, lately Go, Swift, and C#
Not very proficient in C.
3. Goals of this presentation
● I want to give you a hopefully good picture of
what writing a Redis module is about
● I had a couple of good ideas during the
development, and I hope to pass them on
I will start some concepts and the context from
which I’m coming from.
5. Probabilistic Data Structures: Filters
Probabilistic filters are high-speed, space-
efficient data structures that support set-
membership tests with a tweakable probability
of false positives.
“Probably yes”, “Definitely Not”
6. Why bother?
88Bytes * 600 millions = 53GB
That’s the size of a set without counting any
overhead.
A corresponding filter, with a 3% error rate,
would need approximately 600MB.
13. Why not just use Rebloom?
RedisLabs’ Rebloom is an implementation of
Bloom filters for Redis
Unfortunately, we also needed to delete
items, which is not possible with Bloom.
16. Good Ideas
1. You’re probably going to build the Redis
version of something that already exists,
stand on the shoulders of giants
17. Reading a Data Structure Paper
● Don’t expect to understand everything the
first time you read it
● Start with a general overview skipping the
tough formulas / reasonings
● End by crunching the paragraphs one by
one and finding an existing implementation
● Not all papers are good (relevant & comprehensible)
18. The Cuckoo Filters Paper
● Very Clear
● Has relevant formulas & benchmarks
● Links a C++ Implementation
● In my opinion, a great paper overall
19. Starting A New Redis Module
● gcc / clang
● Makefile
● redismodule.h
I won’t get too much into the practical details.
RedisLabs has on GitHub a RedisModuleSDK
repository which contains lots of goodies.
20. Module Architecture
● A function for each command you want to
implement in your module
● RedisModule_OnLoad()
RM_OnLoad() is the entrypoint for your module
and its main purpose is to bind your functions
to the actual command string (eg: cf.add)
21. Inside a Module Function
● Read / Reply-to the request
● Use Redis Commands
● Access the memory directly
● Create your own data types
● Launch background threads
To do all these things there is a plethora of
babysitting functions in redismodule.h
22. Good Ideas
1. You’re probably going to build the Redis
version of something that already exists,
stand on the shoulders of giants
2. You might have the opportunity of coming up
with new and innovative APIs,
question all design choices
23. Design Choices
The distributed version of a Data Structure
might have new interesting properties.
While there is huge value in keeping things
simple, sometimes it might be better to
leverage those properties with a new API
design.
24. An Example: Cuckoo Filters
Normally libraries opt to give a “Bloomy”
interface to Cuckoo Filters.
But internally Cuckoo filters work pretty
differently to Bloom Filters.
25. Cuckoo Filters: Usage
● A Cuckoo Filter is a hash table that stores only
a part of the original item: the fingerprint
● A fingerprint is usually 1, 2, or 4 Bytes coming
from the original item (eg: the first or last byte)
● Cuckoo filters rely on a single hashing* of the
original item, used to choose a slot in the hash
table in which to store the fingerprint
* I’m simplifying it a little bit
26. Cuckoo Filter: Bloomy Interface
fruits.add(“banana”)
1. You give the whole item to the filter
2. The filter computes hash and fingerprint
(same applies to the checking function)
Works as an in-place replacement for Bloom
27. The redis-cuckoofilter API
CF.ADD <key> <hash> <fp>
CF.ADD fruits 5366164415461 98
The user must choose hash() and fp(), and
compute the values on the client.
The values are encoded as integers because byte
arrays are hard on some Redis clients.
28. Why bother?
1. Clients will be sending through the wire a
fixed amount of bytes per element.
2. Cuckoo Filters require a good choice of
fingerprint, which depends on the use case.
3. The filter becomes hashing-function
agnostic, which is good for performance
and interoperability.
29.
30. Good Ideas
1. You’re probably going to build the Redis
version of something that already exists,
stand on the shoulders of giants
2. You might have the opportunity of coming up
with new and innovative APIs,
question all design choices
3. Write fast tests, write slow tests
32. Testing a Redis Module
➔The C compiler will betray you
◆ You will need to try out your module often
➔You can’t just add a couple of items by hand
and be confident that everything is ok
◆ Your suite will need lots of cases
➔“Lots” is not necessarily enough, especially if
your module has a probabilistic behaviour
◆ You can’t just test your module as a blackbox
35. Fast Tests for redis-cuckoofilter
1. Create 3 filters: 64k, 128k, 256k
2. Add 62k items and cf.check them
3.cf.check 124k items that were not inserted
4. Delete half of the 62k and cf.check them
5. Add the deleted items back in
6.cf.check again everything
1.2m operations: 4s on a rmbp
36. Slow Tests for redis-cuckoofilter
● Testing the result of cf.check is not a proof
good enough to trust the filter.
When you get a false positive, is it really
because of the properties of the filter?
● I don’t trust myself to have full control
over all the macroexpanded bit fiddling
that happens in my code
39. Slow Tests for redis-cuckoofilter
● The Python implementation and the C
module go through the same testing routine
● Commands that modify the state of the filter
are executed in lockstep: both
implementations execute the same
command and then test the full filter state
before proceeding to the next
40. Slow Tests for redis-cuckoofilter
● This test claims that the C implementation
seems coherent with the Python one
● Since the Python implementation is much
easier to understand, I feel confident that
everything is working as intended
● This is not a formal proof though, and I might
have made a mistake in the Python code
41. To Recap
1. Good API design should be your main
concern. This is where you can shine.
2. Work smart, not hard: make use of what
already exists and keep fast tests at hand.
3. Don’t trust the compiler; don’t trust your
future self, but do find a way to build a
productive relationship with both.