Stream Duplicate Detection with Redis Counters and LUA Scripting

for duplicate detection on real time stream

whoami(1)
15 years of experience, proud to be a programmer
Writes software for information extraction, nlp, opinion mining (@scale ), and a
lot of other buzzwords
Implements scalable architectures
Member of the JUG-Torino coordination team
ro.franchini@gmail.com github.com/robfrank
twitter.com/robfrankie linkedin.com/in/robfrank
http://www.celi.it http://www.blogmeter.it

Agenda
What is it?
Main features
Caching
Counters
Scripting
How we use it

From the site
Redis is an open source, BSD licensed, advanced
key-value cache and store. It is often referred to
as a data structure server since keys can contain
strings, hashes, lists, sets, sorted sets, bitmaps
and hyperloglogs.

Who use it
Twitter
Github
Youporn
Pinterest
Groupon
...

Ecosystem
Clients in every known language
Articles, books, presentations
On High Scalability every other day

Architecture
Single-threaded server
Yes: single threaded server
Remember that when you need to scale
Single Linux server can handle 500k req/s

Main features
In memory K/V store
But with durable persistence
Master-slave async replica
Transactions
Pub/Sub
Server side LUA scripting

Main features
Keys with TTL
LRU eviction
Keys can contain strings, hashes, lists, sets, sorted sets,
bitmaps and hyperloglogs
REDIS cluster on the go (3.0.0-rc1)

K/V store
Key-value (KV) stores use the associative array (also
known as a map or dictionary) as their fundamental data
model. In this model, data is represented as a collection of
key-value pairs, such that each possible key appears at
most once in the collection. (wikipedia)

K/V store
Key
“plain text”
name rob
surname frank
A C E D B F
A B C D E F
String/blobs/bitmaps
HashTable: Objects
Linked lists
Sets

Persistence
Configurable, two flavors
RDB: perfect for backup
AOF: append only log, replayed at startup
Use AOF + RDB for rock solid persistence
Automatic cache warm-up at startup!!
Only RAM: switch off persistence

Common use cases
Cache
Queue
Session replication
In memory indexes
Centralized ID generation

Basics
SET user:1 frank
GET user:1 → frank
EXISTS user:2 → 1
EXPIRE user:1 3600
INCR count:1
GET count:1 → 1

Basics
KEYS user:* → user:1, user:2
MSET user:1 frank user:2 coder
MGET user:1 user:2 → frank, coder
HMSET userdetail:3 name rob surname frank
HGETALL userdetail:3 → name::rob, surname:: frank

Transactions
MULTI
INCR counter:1
INCR counter:2
EXEC
> 1
> 1
WATCH counter:3
val = GET counter:3
val = val +1
MULTI
SET counter:3 $val
EXEC

Atomic counters
Operators for key increment
INCR counter:1
GET counter:1 → 1
INCRBY counter:1 9
GET counter:1 → 10

LUA scripting
Server side LUA scripting
A “sort of” stored procedure
Scripts are sandboxed
Atomic execution ← bear in mind

LUA scripting
SCRIPT LOAD "return {KEYS[1],KEYS[2]}"
"3905aac1828a8f75707b48e446988eaaeb173f13"
EVALSHA
3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1
user:2
1) "user:1"
2) "user:2"

Caching: server level
Configure REDIS as a cache
maxmemory 1024mb
maxmemory-policy allkeys-lru
all the keys will be evicted using an
approximated LRU algorithm

Caching: TTL on key
Set a timeout on a key
SET doc:1 “mydoc.txt”
EXIPRE doc:1 10
Or
SETEX doc:1 10 “mydoc.txt”

Caching
+
Atomic Counters
+
Atomic LUA scripting

Duplicate detection
Real time stream of documents from
the Internet
20% to 50% of documents are duplicated
DUPLICATES ARE EVIL
And customers don’t pay for that :(

Basic Scenario
5M 3M 3M
Duplicates
Producer
Producer detector NLP Storage
Producer

Avoid duplicated documents
Act on producers was
TOO HARD
Filter-out them before heavy document analysis (NLP)

Documents
“Documents” are from:
twitter
facebook
gplus
instagram
forums
blogs

Documents
Each kind of document has its own natural id
twitter: status id
facebook: post id
forum: URL
blog: URL
We don’t want this IDs inside our system

Duplicate and id generation
Producer
2M
Producer
Producer
Duplicate
detector -
ID
generatio
n
Analysis
Storage
3M
3M
Duplicate
detector -
ID
generatio
n
1M Analysis 1M
5M

Map external keys to internal UID
Generate an ID for each document
IDs are generated using daily named counters:
INCR day:20141028 → 12576
INCR day:20141010 → 23412576
Cache generated ID
tw_1234578688 → day:20141028;12576

Map external keys to internal UID
Documents are internally stored on different storage
systems with their generated id
globalId→ 20141028:3456789

Operations
Natural Keys are cached with TTL
Documents out of time are parked in a staging area
Duplicated documents are usually dropped

LRU cache, counters and LUA
LUA scripts are executed atomically
Wrote a simple script to:
return previous mapped id
or generate id and store key and id in cache
EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123
GET tw_1234566 → 20141028:123

Deployment
Pre-production phase
Single server
70M keys in 10GB of RAM
In production with a simple M/S

Alternatives
PostgreSQL
sequence(s)
table OR hstore
Hazelcast (we are java based)
in memory
write your own persistence

References
http://redis.io/
http://redis.io/commands
http://stackoverflow.com/questions/tagged/redis
http://try.redis.io/

Stream Duplicate Detection with Redis Counters and LUA Scripting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Stream Duplicate Detection with Redis Counters and LUA Scripting

Similar to Stream Duplicate Detection with Redis Counters and LUA Scripting (20)

Recently uploaded

Recently uploaded (20)

Stream Duplicate Detection with Redis Counters and LUA Scripting