for duplicate detection on real time stream
whoami(1) 
15 years of experience, proud to be a programmer 
Writes software for information extraction, nlp, opinion mining (@scale ), and a 
lot of other buzzwords 
Implements scalable architectures 
Member of the JUG-Torino coordination team 
ro.franchini@gmail.com github.com/robfrank 
twitter.com/robfrankie linkedin.com/in/robfrank 
http://www.celi.it http://www.blogmeter.it
Agenda 
What is it? 
Main features 
Caching 
Counters 
Scripting 
How we use it
From the site 
Redis is an open source, BSD licensed, advanced 
key-value cache and store. It is often referred to 
as a data structure server since keys can contain 
strings, hashes, lists, sets, sorted sets, bitmaps 
and hyperloglogs.
Who use it 
Twitter 
Github 
Youporn 
Pinterest 
Groupon 
...
Ecosystem 
Clients in every known language 
Articles, books, presentations 
On High Scalability every other day
Architecture 
Single-threaded server 
Yes: single threaded server 
Remember that when you need to scale 
Single Linux server can handle 500k req/s
Main features 
In memory K/V store 
But with durable persistence 
Master-slave async replica 
Transactions 
Pub/Sub 
Server side LUA scripting
Main features 
Keys with TTL 
LRU eviction 
Keys can contain strings, hashes, lists, sets, sorted sets, 
bitmaps and hyperloglogs 
REDIS cluster on the go (3.0.0-rc1)
K/V store 
Key-value (KV) stores use the associative array (also 
known as a map or dictionary) as their fundamental data 
model. In this model, data is represented as a collection of 
key-value pairs, such that each possible key appears at 
most once in the collection. (wikipedia)
K/V store 
Key 
“plain text” 
name rob 
surname frank 
A C E D B F 
A B C D E F 
String/blobs/bitmaps 
HashTable: Objects 
Linked lists 
Sets
Persistence 
Configurable, two flavors 
RDB: perfect for backup 
AOF: append only log, replayed at startup 
Use AOF + RDB for rock solid persistence 
Automatic cache warm-up at startup!! 
Only RAM: switch off persistence
Common use cases 
Cache 
Queue 
Session replication 
In memory indexes 
Centralized ID generation
Basics 
SET user:1 frank 
GET user:1 → frank 
EXISTS user:2 → 1 
EXPIRE user:1 3600 
INCR count:1 
GET count:1 → 1
Basics 
KEYS user:* → user:1, user:2 
MSET user:1 frank user:2 coder 
MGET user:1 user:2 → frank, coder 
HMSET userdetail:3 name rob surname frank 
HGETALL userdetail:3 → name::rob, surname:: frank
Transactions 
MULTI 
INCR counter:1 
INCR counter:2 
EXEC 
> 1 
> 1 
WATCH counter:3 
val = GET counter:3 
val = val +1 
MULTI 
SET counter:3 $val 
EXEC
Atomic counters 
Operators for key increment 
INCR counter:1 
GET counter:1 → 1 
INCRBY counter:1 9 
GET counter:1 → 10
LUA scripting 
Server side LUA scripting 
A “sort of” stored procedure 
Scripts are sandboxed 
Atomic execution ← bear in mind
LUA scripting 
SCRIPT LOAD "return {KEYS[1],KEYS[2]}" 
"3905aac1828a8f75707b48e446988eaaeb173f13" 
EVALSHA 
3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 
user:2 
1) "user:1" 
2) "user:2"
Caching: server level 
Configure REDIS as a cache 
maxmemory 1024mb 
maxmemory-policy allkeys-lru 
all the keys will be evicted using an 
approximated LRU algorithm
Caching: TTL on key 
Set a timeout on a key 
SET doc:1 “mydoc.txt” 
EXIPRE doc:1 10 
Or 
SETEX doc:1 10 “mydoc.txt”
Demo
Caching 
+ 
Atomic Counters 
+ 
Atomic LUA scripting
Duplicate detection 
Real time stream of documents from 
the Internet 
20% to 50% of documents are duplicated 
DUPLICATES ARE EVIL 
And customers don’t pay for that :(
Basic Scenario 
5M 3M 3M 
Duplicates 
Producer 
Producer detector NLP Storage 
Producer
Avoid duplicated documents 
Act on producers was 
TOO HARD 
Filter-out them before heavy document analysis (NLP)
Documents 
“Documents” are from: 
twitter 
facebook 
gplus 
instagram 
forums 
blogs
Documents 
Each kind of document has its own natural id 
twitter: status id 
facebook: post id 
forum: URL 
blog: URL 
We don’t want this IDs inside our system
Duplicate and id generation 
Producer 
2M 
Producer 
Producer 
Duplicate 
detector - 
ID 
generatio 
n 
Analysis 
Storage 
3M 
3M 
Duplicate 
detector - 
ID 
generatio 
n 
1M Analysis 1M 
5M
Map external keys to internal UID 
Generate an ID for each document 
IDs are generated using daily named counters: 
INCR day:20141028 → 12576 
INCR day:20141010 → 23412576 
Cache generated ID 
tw_1234578688 → day:20141028;12576
Map external keys to internal UID 
Documents are internally stored on different storage 
systems with their generated id 
globalId→ 20141028:3456789
Operations 
Natural Keys are cached with TTL 
Documents out of time are parked in a staging area 
Duplicated documents are usually dropped
LRU cache, counters and LUA 
LUA scripts are executed atomically 
Wrote a simple script to: 
return previous mapped id 
or generate id and store key and id in cache 
EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 
GET tw_1234566 → 20141028:123
Demo
Deployment 
Pre-production phase 
Single server 
70M keys in 10GB of RAM 
In production with a simple M/S
Alternatives 
PostgreSQL 
sequence(s) 
table OR hstore 
Hazelcast (we are java based) 
in memory 
write your own persistence
Q/A
References 
http://redis.io/ 
http://redis.io/commands 
http://stackoverflow.com/questions/tagged/redis 
http://try.redis.io/

Redis for duplicate detection on real time stream

  • 1.
    for duplicate detectionon real time stream
  • 2.
    whoami(1) 15 yearsof experience, proud to be a programmer Writes software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwords Implements scalable architectures Member of the JUG-Torino coordination team ro.franchini@gmail.com github.com/robfrank twitter.com/robfrankie linkedin.com/in/robfrank http://www.celi.it http://www.blogmeter.it
  • 3.
    Agenda What isit? Main features Caching Counters Scripting How we use it
  • 4.
    From the site Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.
  • 5.
    Who use it Twitter Github Youporn Pinterest Groupon ...
  • 6.
    Ecosystem Clients inevery known language Articles, books, presentations On High Scalability every other day
  • 7.
    Architecture Single-threaded server Yes: single threaded server Remember that when you need to scale Single Linux server can handle 500k req/s
  • 8.
    Main features Inmemory K/V store But with durable persistence Master-slave async replica Transactions Pub/Sub Server side LUA scripting
  • 9.
    Main features Keyswith TTL LRU eviction Keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs REDIS cluster on the go (3.0.0-rc1)
  • 10.
    K/V store Key-value(KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. (wikipedia)
  • 11.
    K/V store Key “plain text” name rob surname frank A C E D B F A B C D E F String/blobs/bitmaps HashTable: Objects Linked lists Sets
  • 12.
    Persistence Configurable, twoflavors RDB: perfect for backup AOF: append only log, replayed at startup Use AOF + RDB for rock solid persistence Automatic cache warm-up at startup!! Only RAM: switch off persistence
  • 13.
    Common use cases Cache Queue Session replication In memory indexes Centralized ID generation
  • 14.
    Basics SET user:1frank GET user:1 → frank EXISTS user:2 → 1 EXPIRE user:1 3600 INCR count:1 GET count:1 → 1
  • 15.
    Basics KEYS user:*→ user:1, user:2 MSET user:1 frank user:2 coder MGET user:1 user:2 → frank, coder HMSET userdetail:3 name rob surname frank HGETALL userdetail:3 → name::rob, surname:: frank
  • 16.
    Transactions MULTI INCRcounter:1 INCR counter:2 EXEC > 1 > 1 WATCH counter:3 val = GET counter:3 val = val +1 MULTI SET counter:3 $val EXEC
  • 17.
    Atomic counters Operatorsfor key increment INCR counter:1 GET counter:1 → 1 INCRBY counter:1 9 GET counter:1 → 10
  • 18.
    LUA scripting Serverside LUA scripting A “sort of” stored procedure Scripts are sandboxed Atomic execution ← bear in mind
  • 19.
    LUA scripting SCRIPTLOAD "return {KEYS[1],KEYS[2]}" "3905aac1828a8f75707b48e446988eaaeb173f13" EVALSHA 3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 user:2 1) "user:1" 2) "user:2"
  • 20.
    Caching: server level Configure REDIS as a cache maxmemory 1024mb maxmemory-policy allkeys-lru all the keys will be evicted using an approximated LRU algorithm
  • 21.
    Caching: TTL onkey Set a timeout on a key SET doc:1 “mydoc.txt” EXIPRE doc:1 10 Or SETEX doc:1 10 “mydoc.txt”
  • 22.
  • 23.
    Caching + AtomicCounters + Atomic LUA scripting
  • 24.
    Duplicate detection Realtime stream of documents from the Internet 20% to 50% of documents are duplicated DUPLICATES ARE EVIL And customers don’t pay for that :(
  • 25.
    Basic Scenario 5M3M 3M Duplicates Producer Producer detector NLP Storage Producer
  • 26.
    Avoid duplicated documents Act on producers was TOO HARD Filter-out them before heavy document analysis (NLP)
  • 27.
    Documents “Documents” arefrom: twitter facebook gplus instagram forums blogs
  • 28.
    Documents Each kindof document has its own natural id twitter: status id facebook: post id forum: URL blog: URL We don’t want this IDs inside our system
  • 29.
    Duplicate and idgeneration Producer 2M Producer Producer Duplicate detector - ID generatio n Analysis Storage 3M 3M Duplicate detector - ID generatio n 1M Analysis 1M 5M
  • 30.
    Map external keysto internal UID Generate an ID for each document IDs are generated using daily named counters: INCR day:20141028 → 12576 INCR day:20141010 → 23412576 Cache generated ID tw_1234578688 → day:20141028;12576
  • 31.
    Map external keysto internal UID Documents are internally stored on different storage systems with their generated id globalId→ 20141028:3456789
  • 32.
    Operations Natural Keysare cached with TTL Documents out of time are parked in a staging area Duplicated documents are usually dropped
  • 33.
    LRU cache, countersand LUA LUA scripts are executed atomically Wrote a simple script to: return previous mapped id or generate id and store key and id in cache EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 GET tw_1234566 → 20141028:123
  • 34.
  • 35.
    Deployment Pre-production phase Single server 70M keys in 10GB of RAM In production with a simple M/S
  • 36.
    Alternatives PostgreSQL sequence(s) table OR hstore Hazelcast (we are java based) in memory write your own persistence
  • 37.
  • 38.
    References http://redis.io/ http://redis.io/commands http://stackoverflow.com/questions/tagged/redis http://try.redis.io/