Leveraging Probabilistic Data Structures for Real Time Analytics with Redis Modules

Leveraging Probabilistic Data Structures for
Real Time Analytics with Redis Modules
Itamar Haber, Redis Labs

2
Who We Are
The open source home and commercial provider
of Redis
Open source. The leading in-memory database
platform, supporting any high performance
OLTP or OLAP use case.
Chief Developer Advocate at Redis Labs
itamar@redislabs.com
@itamarhaber

3
There are three kinds
of people in the world;
those who can count
and those who can’t.

5
1.Redis: REmote DIctionary Server
2./ rɛdɪs/: “red-iss”
3.OSS: http://github.com/antirez/redis
4.3-clause BSD-license: http://redis.io
5.In-memory: (always) read from RAM
6.A database for: 5 data structures
7.And: 4 more specialized ones

6
8.Developed & maintained: (mostly)
Salvatore Sanfilippo (a.k.a. @antirez)
and his OSS team at @RedisLabs
9.Short history:v1.0 August 9th, 2009
… v3.2 May 6th, 2016
10.“The Leatherman™ of Databases”:
mostly used as a DB, cache & broker

7
11.A couple or so of extra features:
(a) atomicity; (b) transactions;
(c) configurable persistence;
(d) expiration; (e) eviction;
(f) PubSub; (g) Lua scripts;
(h) high availability; (i) clustering
12.Next version (v4.0): MODULES!

8
2..4 Reasons Why Redis Is A Must For Any
Data-Driven Real-Time Analytical Process
Simplicity VersatilityPerformance
“it is very fast”
Next 3 slides
+ ‘demo’
while(!eof)

9
Redis 101
1. Redis is “NoSQL”
0. No (explicit) schema, access by key
1. Key -> structure -> data
SIMPL-ICI-TY: simple, I see it, thank you

11
Redis for Data Scientists steps:
1. 147 OSS clients (49 languages), e.g.:
Python, Java, Spark, R, Julia, Matlab
2. Make a request with the client, i.e.:
PING
3. Server sends back the reply, i.e.g:
PONG

12
~$ redis-cli
127.0.0.1:6379> SET counter 1
OK
127.0.0.1:6379> GET counter
"1"
127.0.0.1:6379> INCR counter
(integer) 2
127.0.0.1:6379> APPEND counter b||!2b
(integer) 7
127.0.0.1:6379> GETSET counter "x00HelloxffWorld"
"2b||!2b"
127.0.0.1:6379>

14
Flexibility (v0.0.1): model (almost)
anything with basic “building blocks”
and simple rules
Composability: transactions (v1.2) and
server-embedded scripted logic (v2.6)
Extendibility: modules (v4) for adding
custom data structures and commands

16
Redis Modules are:
1. Dynamically loaded libraries
2. Future-compatible binaries
3. (will be mostly) written in C
4. Use an API that the server provides
5. (nearly) as fast as core commands
6. Planned for public release Q3 2016

17
3 layers of the Modules API:
1.Operational layer: admin, memory,
disk, call arguments, replies…
2.High-level layer: client-like access to
core and modules’ commands
3.Low-level: (almost) native access to
core data structures memory

18
Benchmark: Sum Of 1M Scores In A Sorted Set
Methodology Time (sec)
Local client (Python) 1.2
Embedded script (Lua) 1.25
High-level API 1.05
Low-level Iterators API 0.1

19
On average
about 63.79%
of all statistics
are made up

Probabilistic Data Structures (PDSs)

21
There are three kinds
of data structures…
…and those who both
can count and can’t.

22
Data Structures of the 3rd kind
• Why: accuracy is (in theory) possible
but scale makes it (nearly) impossible
• Example: number of unique visitors
• Alternative: estimate the answer
• Data structure: the HyperLogLog
• Ergo: modules as models for PDSs

23
The “good” PDSs are
1. Efficient: sublinear space-time
2. Accurate: within their parameters
3. Scalable: by merging et al.
4. Suspiciously not unlike: the Infinite
Improbability Drive (The Hitch Hiker
Guide to the Galaxy, Adams D.)

24
Top-K - k most frequent samples
The entire algorithm:
1. Maintain set S of k counters
2. For each sample s:
2.1 If s exists in S, increment S(x)
2.1 Otherwise, if there’s space add x
to S , else decrement all counters

25
Modelling Top-K with Redis
1. Sorted Set -> unique members
2. Member -> element and score
3. ZSCORE: O(1) membership
4. ZADD: O(Log(N)) write
5. ZRANGEBYSCORE: O(Log(N)) seek

26
redis> TOPK.ADD tk 2 a
(integer) 1
redis> TOPK.ADD tk 2 b
(integer) 1
redis> TOPK.ADD tk 2 b
(integer) 0
redis> ZRANGE tk 1 -1 WITHSCORES
1) "a"
2) "1"
3) "b"
4) "2"
redis> TOPK.ADD tk 2 c
(integer) -1
k
sample
score
1 means added
0 is freq. incr.
indicates eviction

27
1) "b"
2) "2"
3) "c"
4) "2"
redis> TOPK.ADD tk 2 c
(integer) 0
1) "b"
2) "2"
3) "c"
4) "3"
a evicted, c added
b’s and c’s score = 2
(global offset = -1)

28
topk Redis Module
1.Optimization: a global score offset
2.Eviction: reservoir sampling
3.TOPK.PRANK: percentile rank
4.TOPK.PRANGE: percentile range
5.Where: Redis Module Hub/topk

29
Bloom filter – set membership
1.Answers: “have I seen this?”
2.Good for: avoiding hard work
3.Promises: no false negatives
4.Sometimes: false positives (error)
5.Gist: hash values of the samples are
indexes in an array of counters

30
redis> CBF.ADD bloom a
(integer) 1
redis> CBF.ADD bloom b
(integer) 2
redis> CBF.CHECK bloom a
(integer) 1
redis> CBF.CHECK bloom b
(integer) 1
redis> CBF.CHECK bloom c
(integer) 0
0 1 0 21 0
h1(a), h2(a)
h1(b), h2(b)h1(c), h2(c)

31
redablooms Redis Module
1.Error rate: defaults to %5
2.Counting: 4-bit registers, allows
removing samples, default capacity is
100,00 samples
3.Scalable: multiple filters combined
4.Redis Module Hub/redablooms

32
Count Min Sketch - item counts
1.Unlike Top-K:
answers about any sample
2.WRT Bloom filters -
Like: hashes as indexes to counters
Unlike: array per hash function,
returns the minimum of counters

33
redis> CMS.INCRBY count a 1 b 2
OK
redis> CMS.QUERY count b
(integer) 2
0 1 0 00 2 h1
0 0 0 03 0 h2
collision
min[h1(b), h2(b)] hi(b) hi(b)

34
countminsketch Redis Module
1.Registers width: 16-bit
2.Default maximum error: %0.01
3.Default error probability: %0.01
4.Redis Module Hub/countminsketch

35
redismodules.com: Redis Module Hub

36
What Is The Hub
1.Modules developed by: anyone
2.Certified by: Redis Labs
3.Licenses: Open Source & Commercial
4.Distributed with: Redis Cloud and
Redis Labs Enterprise Cluster
5.Where: redismodules.com

39
1. The Redis Open Source Project Website – http://redis.io
2. Redis source on GitHub – http://github.com/antirez/redis
3. Redis commands documentation – http://redis.io/commands
4. Infinite Improbability Drive –
https://en.wikipedia.org/wiki/Technology_in_The_Hitchhiker%27s_G
uide_to_the_Galaxy#Infinite_Improbability_Drive
5. Streaming Algorithms: Frequent Items –
https://people.eecs.berkeley.edu/~satishr/cs270/sp11/rough-
notes/Streaming-two.pdf
6. Space/Time Trade-offs in Hash Coding with Allowable Errors –
http://dmod.eu/deca/ft_gateway.cfm.pdf
7. Approximating Data with the Count-Min Data Structure –
http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf

Leveraging Probabilistic Data Structures for Real Time Analytics with Redis Modules

More Related Content

What's hot

Viewers also liked

Similar to Leveraging Probabilistic Data Structures for Real Time Analytics with Redis Modules

More from Itamar Haber

Recently uploaded

Leveraging Probabilistic Data Structures for Real Time Analytics with Redis Modules