Leveraging Probabilistic Data Structures for
Real Time Analytics with Redis Modules
Itamar Haber, Redis Labs
2
Who We Are
The open source home and commercial provider
of Redis
Open source. The leading in-memory database
platform, supporting any high performance
OLTP or OLAP use case.
Chief Developer Advocate at Redis Labs
itamar@redislabs.com
@itamarhaber
3
There are three kinds
of people in the world;
those who can count
and those who can’t.
About 10 Things About Redis
5
1.Redis: REmote DIctionary Server
2./ rɛdɪs/: “red-iss”
3.OSS: http://github.com/antirez/redis
4.3-clause BSD-license: http://redis.io
5.In-memory: (always) read from RAM
6.A database for: 5 data structures
7.And: 4 more specialized ones
6
8.Developed & maintained: (mostly)
Salvatore Sanfilippo (a.k.a. @antirez)
and his OSS team at @RedisLabs
9.Short history:v1.0 August 9th, 2009
… v3.2 May 6th, 2016
10.“The Leatherman™ of Databases”:
mostly used as a DB, cache & broker
7
11.A couple or so of extra features:
(a) atomicity; (b) transactions;
(c) configurable persistence;
(d) expiration; (e) eviction;
(f) PubSub; (g) Lua scripts;
(h) high availability; (i) clustering
12.Next version (v4.0): MODULES!
8
2..4 Reasons Why Redis Is A Must For Any
Data-Driven Real-Time Analytical Process
Simplicity VersatilityPerformance
“it is very fast”
Next 3 slides
+ ‘demo’
while(!eof)
9
Redis 101
1. Redis is “NoSQL”
0. No (explicit) schema, access by key
1. Key -> structure -> data
SIMPL-ICI-TY: simple, I see it, thank you
10
psql> SELECT * FROM redis.data_structures;
+-------------+------------------------------------+
| name | description |
+-------------+------------------------------------+
| Strings | Any 0..512MB payload or number |
| Hashes | Unordered field-value String pairs |
| Lists | A double-linked list of Strings |
| Sets | An unordered collection of Strings |
| Sorted Sets | Set members have a float score |
| Geo Sets | Sorted Set with geohash scores |
| Bit Arrays | String representing bits by offset |
| Bit Fields | Many 1..64bit integers in a String |
| HyperLogLog | PDS for estimating cardinality |
+-------------+------------------------------------+
11
Redis for Data Scientists steps:
1. 147 OSS clients (49 languages), e.g.:
Python, Java, Spark, R, Julia, Matlab
2. Make a request with the client, i.e.:
PING
3. Server sends back the reply, i.e.g:
PONG
12
~$ redis-cli
127.0.0.1:6379> SET counter 1
OK
127.0.0.1:6379> GET counter
"1"
127.0.0.1:6379> INCR counter
(integer) 2
127.0.0.1:6379> APPEND counter b||!2b
(integer) 7
127.0.0.1:6379> GETSET counter "x00HelloxffWorld"
"2b||!2b"
127.0.0.1:6379>
The Evolution of Versatility
14
Flexibility (v0.0.1): model (almost)
anything with basic “building blocks”
and simple rules
Composability: transactions (v1.2) and
server-embedded scripted logic (v2.6)
Extendibility: modules (v4) for adding
custom data structures and commands
Redis Modules
16
Redis Modules are:
1. Dynamically loaded libraries
2. Future-compatible binaries
3. (will be mostly) written in C
4. Use an API that the server provides
5. (nearly) as fast as core commands
6. Planned for public release Q3 2016
17
3 layers of the Modules API:
1.Operational layer: admin, memory,
disk, call arguments, replies…
2.High-level layer: client-like access to
core and modules’ commands
3.Low-level: (almost) native access to
core data structures memory
18
Benchmark: Sum Of 1M Scores In A Sorted Set
Methodology Time (sec)
Local client (Python) 1.2
Embedded script (Lua) 1.25
High-level API 1.05
Low-level Iterators API 0.1
19
On average
about 63.79%
of all statistics
are made up
Probabilistic Data Structures (PDSs)
21
There are three kinds
of data structures…
…and those who both
can count and can’t.
22
Data Structures of the 3rd kind
• Why: accuracy is (in theory) possible
but scale makes it (nearly) impossible
• Example: number of unique visitors
• Alternative: estimate the answer
• Data structure: the HyperLogLog
• Ergo: modules as models for PDSs
23
The “good” PDSs are
1. Efficient: sublinear space-time
2. Accurate: within their parameters
3. Scalable: by merging et al.
4. Suspiciously not unlike: the Infinite
Improbability Drive (The Hitch Hiker
Guide to the Galaxy, Adams D.)
24
Top-K - k most frequent samples
The entire algorithm:
1. Maintain set S of k counters
2. For each sample s:
2.1 If s exists in S, increment S(x)
2.1 Otherwise, if there’s space add x
to S , else decrement all counters
25
Modelling Top-K with Redis
1. Sorted Set -> unique members
2. Member -> element and score
3. ZSCORE: O(1) membership
4. ZADD: O(Log(N)) write
5. ZRANGEBYSCORE: O(Log(N)) seek
26
redis> TOPK.ADD tk 2 a
(integer) 1
redis> TOPK.ADD tk 2 b
(integer) 1
redis> TOPK.ADD tk 2 b
(integer) 0
redis> ZRANGE tk 1 -1 WITHSCORES
1) "a"
2) "1"
3) "b"
4) "2"
redis> TOPK.ADD tk 2 c
(integer) -1
k
sample
score
1 means added
0 is freq. incr.
indicates eviction
27
redis> ZRANGE tk 1 -1 WITHSCORES
1) "b"
2) "2"
3) "c"
4) "2"
redis> TOPK.ADD tk 2 c
(integer) 0
redis> ZRANGE tk 1 -1 WITHSCORES
1) "b"
2) "2"
3) "c"
4) "3"
a evicted, c added
b’s and c’s score = 2
(global offset = -1)
28
topk Redis Module
1.Optimization: a global score offset
2.Eviction: reservoir sampling
3.TOPK.PRANK: percentile rank
4.TOPK.PRANGE: percentile range
5.Where: Redis Module Hub/topk
29
Bloom filter – set membership
1.Answers: “have I seen this?”
2.Good for: avoiding hard work
3.Promises: no false negatives
4.Sometimes: false positives (error)
5.Gist: hash values of the samples are
indexes in an array of counters
30
redis> CBF.ADD bloom a
(integer) 1
redis> CBF.ADD bloom b
(integer) 2
redis> CBF.CHECK bloom a
(integer) 1
redis> CBF.CHECK bloom b
(integer) 1
redis> CBF.CHECK bloom c
(integer) 0
0 1 0 21 0
h1(a), h2(a)
h1(b), h2(b)h1(c), h2(c)
31
redablooms Redis Module
1.Error rate: defaults to %5
2.Counting: 4-bit registers, allows
removing samples, default capacity is
100,00 samples
3.Scalable: multiple filters combined
4.Redis Module Hub/redablooms
32
Count Min Sketch - item counts
1.Unlike Top-K:
answers about any sample
2.WRT Bloom filters -
Like: hashes as indexes to counters
Unlike: array per hash function,
returns the minimum of counters
33
redis> CMS.INCRBY count a 1 b 2
OK
redis> CMS.QUERY count b
(integer) 2
0 1 0 00 2 h1
0 0 0 03 0 h2
collision
min[h1(b), h2(b)] hi(b) hi(b)
34
countminsketch Redis Module
1.Registers width: 16-bit
2.Default maximum error: %0.01
3.Default error probability: %0.01
4.Redis Module Hub/countminsketch
35
redismodules.com: Redis Module Hub
36
What Is The Hub
1.Modules developed by: anyone
2.Certified by: Redis Labs
3.Licenses: Open Source & Commercial
4.Distributed with: Redis Cloud and
Redis Labs Enterprise Cluster
5.Where: redismodules.com
Thank you
Further Reading
39
1. The Redis Open Source Project Website – http://redis.io
2. Redis source on GitHub – http://github.com/antirez/redis
3. Redis commands documentation – http://redis.io/commands
4. Infinite Improbability Drive –
https://en.wikipedia.org/wiki/Technology_in_The_Hitchhiker%27s_G
uide_to_the_Galaxy#Infinite_Improbability_Drive
5. Streaming Algorithms: Frequent Items –
https://people.eecs.berkeley.edu/~satishr/cs270/sp11/rough-
notes/Streaming-two.pdf
6. Space/Time Trade-offs in Hash Coding with Allowable Errors –
http://dmod.eu/deca/ft_gateway.cfm.pdf
7. Approximating Data with the Count-Min Data Structure –
http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf

Leveraging Probabilistic Data Structures for Real Time Analytics with Redis Modules

  • 1.
    Leveraging Probabilistic DataStructures for Real Time Analytics with Redis Modules Itamar Haber, Redis Labs
  • 2.
    2 Who We Are Theopen source home and commercial provider of Redis Open source. The leading in-memory database platform, supporting any high performance OLTP or OLAP use case. Chief Developer Advocate at Redis Labs itamar@redislabs.com @itamarhaber
  • 3.
    3 There are threekinds of people in the world; those who can count and those who can’t.
  • 4.
    About 10 ThingsAbout Redis
  • 5.
    5 1.Redis: REmote DIctionaryServer 2./ rɛdɪs/: “red-iss” 3.OSS: http://github.com/antirez/redis 4.3-clause BSD-license: http://redis.io 5.In-memory: (always) read from RAM 6.A database for: 5 data structures 7.And: 4 more specialized ones
  • 6.
    6 8.Developed & maintained:(mostly) Salvatore Sanfilippo (a.k.a. @antirez) and his OSS team at @RedisLabs 9.Short history:v1.0 August 9th, 2009 … v3.2 May 6th, 2016 10.“The Leatherman™ of Databases”: mostly used as a DB, cache & broker
  • 7.
    7 11.A couple orso of extra features: (a) atomicity; (b) transactions; (c) configurable persistence; (d) expiration; (e) eviction; (f) PubSub; (g) Lua scripts; (h) high availability; (i) clustering 12.Next version (v4.0): MODULES!
  • 8.
    8 2..4 Reasons WhyRedis Is A Must For Any Data-Driven Real-Time Analytical Process Simplicity VersatilityPerformance “it is very fast” Next 3 slides + ‘demo’ while(!eof)
  • 9.
    9 Redis 101 1. Redisis “NoSQL” 0. No (explicit) schema, access by key 1. Key -> structure -> data SIMPL-ICI-TY: simple, I see it, thank you
  • 10.
    10 psql> SELECT *FROM redis.data_structures; +-------------+------------------------------------+ | name | description | +-------------+------------------------------------+ | Strings | Any 0..512MB payload or number | | Hashes | Unordered field-value String pairs | | Lists | A double-linked list of Strings | | Sets | An unordered collection of Strings | | Sorted Sets | Set members have a float score | | Geo Sets | Sorted Set with geohash scores | | Bit Arrays | String representing bits by offset | | Bit Fields | Many 1..64bit integers in a String | | HyperLogLog | PDS for estimating cardinality | +-------------+------------------------------------+
  • 11.
    11 Redis for DataScientists steps: 1. 147 OSS clients (49 languages), e.g.: Python, Java, Spark, R, Julia, Matlab 2. Make a request with the client, i.e.: PING 3. Server sends back the reply, i.e.g: PONG
  • 12.
    12 ~$ redis-cli 127.0.0.1:6379> SETcounter 1 OK 127.0.0.1:6379> GET counter "1" 127.0.0.1:6379> INCR counter (integer) 2 127.0.0.1:6379> APPEND counter b||!2b (integer) 7 127.0.0.1:6379> GETSET counter "x00HelloxffWorld" "2b||!2b" 127.0.0.1:6379>
  • 13.
    The Evolution ofVersatility
  • 14.
    14 Flexibility (v0.0.1): model(almost) anything with basic “building blocks” and simple rules Composability: transactions (v1.2) and server-embedded scripted logic (v2.6) Extendibility: modules (v4) for adding custom data structures and commands
  • 15.
  • 16.
    16 Redis Modules are: 1.Dynamically loaded libraries 2. Future-compatible binaries 3. (will be mostly) written in C 4. Use an API that the server provides 5. (nearly) as fast as core commands 6. Planned for public release Q3 2016
  • 17.
    17 3 layers ofthe Modules API: 1.Operational layer: admin, memory, disk, call arguments, replies… 2.High-level layer: client-like access to core and modules’ commands 3.Low-level: (almost) native access to core data structures memory
  • 18.
    18 Benchmark: Sum Of1M Scores In A Sorted Set Methodology Time (sec) Local client (Python) 1.2 Embedded script (Lua) 1.25 High-level API 1.05 Low-level Iterators API 0.1
  • 19.
    19 On average about 63.79% ofall statistics are made up
  • 20.
  • 21.
    21 There are threekinds of data structures… …and those who both can count and can’t.
  • 22.
    22 Data Structures ofthe 3rd kind • Why: accuracy is (in theory) possible but scale makes it (nearly) impossible • Example: number of unique visitors • Alternative: estimate the answer • Data structure: the HyperLogLog • Ergo: modules as models for PDSs
  • 23.
    23 The “good” PDSsare 1. Efficient: sublinear space-time 2. Accurate: within their parameters 3. Scalable: by merging et al. 4. Suspiciously not unlike: the Infinite Improbability Drive (The Hitch Hiker Guide to the Galaxy, Adams D.)
  • 24.
    24 Top-K - kmost frequent samples The entire algorithm: 1. Maintain set S of k counters 2. For each sample s: 2.1 If s exists in S, increment S(x) 2.1 Otherwise, if there’s space add x to S , else decrement all counters
  • 25.
    25 Modelling Top-K withRedis 1. Sorted Set -> unique members 2. Member -> element and score 3. ZSCORE: O(1) membership 4. ZADD: O(Log(N)) write 5. ZRANGEBYSCORE: O(Log(N)) seek
  • 26.
    26 redis> TOPK.ADD tk2 a (integer) 1 redis> TOPK.ADD tk 2 b (integer) 1 redis> TOPK.ADD tk 2 b (integer) 0 redis> ZRANGE tk 1 -1 WITHSCORES 1) "a" 2) "1" 3) "b" 4) "2" redis> TOPK.ADD tk 2 c (integer) -1 k sample score 1 means added 0 is freq. incr. indicates eviction
  • 27.
    27 redis> ZRANGE tk1 -1 WITHSCORES 1) "b" 2) "2" 3) "c" 4) "2" redis> TOPK.ADD tk 2 c (integer) 0 redis> ZRANGE tk 1 -1 WITHSCORES 1) "b" 2) "2" 3) "c" 4) "3" a evicted, c added b’s and c’s score = 2 (global offset = -1)
  • 28.
    28 topk Redis Module 1.Optimization:a global score offset 2.Eviction: reservoir sampling 3.TOPK.PRANK: percentile rank 4.TOPK.PRANGE: percentile range 5.Where: Redis Module Hub/topk
  • 29.
    29 Bloom filter –set membership 1.Answers: “have I seen this?” 2.Good for: avoiding hard work 3.Promises: no false negatives 4.Sometimes: false positives (error) 5.Gist: hash values of the samples are indexes in an array of counters
  • 30.
    30 redis> CBF.ADD blooma (integer) 1 redis> CBF.ADD bloom b (integer) 2 redis> CBF.CHECK bloom a (integer) 1 redis> CBF.CHECK bloom b (integer) 1 redis> CBF.CHECK bloom c (integer) 0 0 1 0 21 0 h1(a), h2(a) h1(b), h2(b)h1(c), h2(c)
  • 31.
    31 redablooms Redis Module 1.Errorrate: defaults to %5 2.Counting: 4-bit registers, allows removing samples, default capacity is 100,00 samples 3.Scalable: multiple filters combined 4.Redis Module Hub/redablooms
  • 32.
    32 Count Min Sketch- item counts 1.Unlike Top-K: answers about any sample 2.WRT Bloom filters - Like: hashes as indexes to counters Unlike: array per hash function, returns the minimum of counters
  • 33.
    33 redis> CMS.INCRBY counta 1 b 2 OK redis> CMS.QUERY count b (integer) 2 0 1 0 00 2 h1 0 0 0 03 0 h2 collision min[h1(b), h2(b)] hi(b) hi(b)
  • 34.
    34 countminsketch Redis Module 1.Registerswidth: 16-bit 2.Default maximum error: %0.01 3.Default error probability: %0.01 4.Redis Module Hub/countminsketch
  • 35.
  • 36.
    36 What Is TheHub 1.Modules developed by: anyone 2.Certified by: Redis Labs 3.Licenses: Open Source & Commercial 4.Distributed with: Redis Cloud and Redis Labs Enterprise Cluster 5.Where: redismodules.com
  • 37.
  • 38.
  • 39.
    39 1. The RedisOpen Source Project Website – http://redis.io 2. Redis source on GitHub – http://github.com/antirez/redis 3. Redis commands documentation – http://redis.io/commands 4. Infinite Improbability Drive – https://en.wikipedia.org/wiki/Technology_in_The_Hitchhiker%27s_G uide_to_the_Galaxy#Infinite_Improbability_Drive 5. Streaming Algorithms: Frequent Items – https://people.eecs.berkeley.edu/~satishr/cs270/sp11/rough- notes/Streaming-two.pdf 6. Space/Time Trade-offs in Hash Coding with Allowable Errors – http://dmod.eu/deca/ft_gateway.cfm.pdf 7. Approximating Data with the Count-Min Data Structure – http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf