SlideShare a Scribd company logo
1 of 38
Download to read offline
for duplicate detection on real time stream
whoami(1) 
15 years of experience, proud to be a programmer 
Writes software for information extraction, nlp, opinion mining (@scale ), and a 
lot of other buzzwords 
Implements scalable architectures 
Member of the JUG-Torino coordination team 
ro.franchini@gmail.com github.com/robfrank 
twitter.com/robfrankie linkedin.com/in/robfrank 
http://www.celi.it http://www.blogmeter.it
Agenda 
What is it? 
Main features 
Caching 
Counters 
Scripting 
How we use it
From the site 
Redis is an open source, BSD licensed, advanced 
key-value cache and store. It is often referred to 
as a data structure server since keys can contain 
strings, hashes, lists, sets, sorted sets, bitmaps 
and hyperloglogs.
Who use it 
Twitter 
Github 
Youporn 
Pinterest 
Groupon 
...
Ecosystem 
Clients in every known language 
Articles, books, presentations 
On High Scalability every other day
Architecture 
Single-threaded server 
Yes: single threaded server 
Remember that when you need to scale 
Single Linux server can handle 500k req/s
Main features 
In memory K/V store 
But with durable persistence 
Master-slave async replica 
Transactions 
Pub/Sub 
Server side LUA scripting
Main features 
Keys with TTL 
LRU eviction 
Keys can contain strings, hashes, lists, sets, sorted sets, 
bitmaps and hyperloglogs 
REDIS cluster on the go (3.0.0-rc1)
K/V store 
Key-value (KV) stores use the associative array (also 
known as a map or dictionary) as their fundamental data 
model. In this model, data is represented as a collection of 
key-value pairs, such that each possible key appears at 
most once in the collection. (wikipedia)
K/V store 
Key 
“plain text” 
name rob 
surname frank 
A C E D B F 
A B C D E F 
String/blobs/bitmaps 
HashTable: Objects 
Linked lists 
Sets
Persistence 
Configurable, two flavors 
RDB: perfect for backup 
AOF: append only log, replayed at startup 
Use AOF + RDB for rock solid persistence 
Automatic cache warm-up at startup!! 
Only RAM: switch off persistence
Common use cases 
Cache 
Queue 
Session replication 
In memory indexes 
Centralized ID generation
Basics 
SET user:1 frank 
GET user:1 → frank 
EXISTS user:2 → 1 
EXPIRE user:1 3600 
INCR count:1 
GET count:1 → 1
Basics 
KEYS user:* → user:1, user:2 
MSET user:1 frank user:2 coder 
MGET user:1 user:2 → frank, coder 
HMSET userdetail:3 name rob surname frank 
HGETALL userdetail:3 → name::rob, surname:: frank
Transactions 
MULTI 
INCR counter:1 
INCR counter:2 
EXEC 
> 1 
> 1 
WATCH counter:3 
val = GET counter:3 
val = val +1 
MULTI 
SET counter:3 $val 
EXEC
Atomic counters 
Operators for key increment 
INCR counter:1 
GET counter:1 → 1 
INCRBY counter:1 9 
GET counter:1 → 10
LUA scripting 
Server side LUA scripting 
A “sort of” stored procedure 
Scripts are sandboxed 
Atomic execution ← bear in mind
LUA scripting 
SCRIPT LOAD "return {KEYS[1],KEYS[2]}" 
"3905aac1828a8f75707b48e446988eaaeb173f13" 
EVALSHA 
3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 
user:2 
1) "user:1" 
2) "user:2"
Caching: server level 
Configure REDIS as a cache 
maxmemory 1024mb 
maxmemory-policy allkeys-lru 
all the keys will be evicted using an 
approximated LRU algorithm
Caching: TTL on key 
Set a timeout on a key 
SET doc:1 “mydoc.txt” 
EXIPRE doc:1 10 
Or 
SETEX doc:1 10 “mydoc.txt”
Demo
Caching 
+ 
Atomic Counters 
+ 
Atomic LUA scripting
Duplicate detection 
Real time stream of documents from 
the Internet 
20% to 50% of documents are duplicated 
DUPLICATES ARE EVIL 
And customers don’t pay for that :(
Basic Scenario 
5M 3M 3M 
Duplicates 
Producer 
Producer detector NLP Storage 
Producer
Avoid duplicated documents 
Act on producers was 
TOO HARD 
Filter-out them before heavy document analysis (NLP)
Documents 
“Documents” are from: 
twitter 
facebook 
gplus 
instagram 
forums 
blogs
Documents 
Each kind of document has its own natural id 
twitter: status id 
facebook: post id 
forum: URL 
blog: URL 
We don’t want this IDs inside our system
Duplicate and id generation 
Producer 
2M 
Producer 
Producer 
Duplicate 
detector - 
ID 
generatio 
n 
Analysis 
Storage 
3M 
3M 
Duplicate 
detector - 
ID 
generatio 
n 
1M Analysis 1M 
5M
Map external keys to internal UID 
Generate an ID for each document 
IDs are generated using daily named counters: 
INCR day:20141028 → 12576 
INCR day:20141010 → 23412576 
Cache generated ID 
tw_1234578688 → day:20141028;12576
Map external keys to internal UID 
Documents are internally stored on different storage 
systems with their generated id 
globalId→ 20141028:3456789
Operations 
Natural Keys are cached with TTL 
Documents out of time are parked in a staging area 
Duplicated documents are usually dropped
LRU cache, counters and LUA 
LUA scripts are executed atomically 
Wrote a simple script to: 
return previous mapped id 
or generate id and store key and id in cache 
EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 
GET tw_1234566 → 20141028:123
Demo
Deployment 
Pre-production phase 
Single server 
70M keys in 10GB of RAM 
In production with a simple M/S
Alternatives 
PostgreSQL 
sequence(s) 
table OR hstore 
Hazelcast (we are java based) 
in memory 
write your own persistence
Q/A
References 
http://redis.io/ 
http://redis.io/commands 
http://stackoverflow.com/questions/tagged/redis 
http://try.redis.io/

More Related Content

What's hot

事例で学ぶApache Cassandra
事例で学ぶApache Cassandra事例で学ぶApache Cassandra
事例で学ぶApache CassandraYuki Morishita
 
Monitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMonitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMartin Etmajer
 
ELK in Security Analytics
ELK in Security Analytics ELK in Security Analytics
ELK in Security Analytics nullowaspmumbai
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?Anton Zadorozhniy
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentDataWorks Summit
 
Kerbernetes Robotics Distributed System Deep Dive
Kerbernetes Robotics Distributed System Deep DiveKerbernetes Robotics Distributed System Deep Dive
Kerbernetes Robotics Distributed System Deep DiveTomoya Fujita
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraDataStax
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashAmazon Web Services
 
Container Runtime Security with Falco
Container Runtime Security with FalcoContainer Runtime Security with Falco
Container Runtime Security with FalcoMichael Ducy
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaAmazon Web Services
 
コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線Motonori Shindo
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモEtsuji Nakai
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack PresentationAmr Alaa Yassen
 
Elastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElasticsearch
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks EDB
 

What's hot (20)

事例で学ぶApache Cassandra
事例で学ぶApache Cassandra事例で学ぶApache Cassandra
事例で学ぶApache Cassandra
 
Monitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMonitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on Kubernetes
 
ELK in Security Analytics
ELK in Security Analytics ELK in Security Analytics
ELK in Security Analytics
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
 
Kerbernetes Robotics Distributed System Deep Dive
Kerbernetes Robotics Distributed System Deep DiveKerbernetes Robotics Distributed System Deep Dive
Kerbernetes Robotics Distributed System Deep Dive
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 
Container Runtime Security with Falco
Container Runtime Security with FalcoContainer Runtime Security with Falco
Container Runtime Security with Falco
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
 
コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線
 
Spannerに関する技術メモ
Spannerに関する技術メモSpannerに関する技術メモ
Spannerに関する技術メモ
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Elastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full picture
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 

Viewers also liked

Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamCodemotion
 
Push! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsPush! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsDominik Obermaier
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 
Codemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSCodemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSRoberto Franchini
 
What the hell is your software doing at runtime?
What the hell is your software doing at runtime?What the hell is your software doing at runtime?
What the hell is your software doing at runtime?Roberto Franchini
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Roberto Franchini
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataRoberto Franchini
 
Scaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamScaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamDominik Obermaier
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysItamar Haber
 
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Redis Labs
 
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Maarten Balliauw
 
Redis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaRedis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaItamar Haber
 
Getting Started with Redis
Getting Started with RedisGetting Started with Redis
Getting Started with RedisFaisal Akber
 
RespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellRespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellYoshifumi Kawai
 
UV logic using redis bitmap
UV logic using redis bitmapUV logic using redis bitmap
UV logic using redis bitmap주용 오
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProRedis Labs
 
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre... Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...Redis Labs
 
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulRedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulDynomiteDB
 

Viewers also liked (20)

Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
 
Push! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsPush! - MQTT for the Internet of Things
Push! - MQTT for the Internet of Things
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Codemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSCodemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFS
 
What the hell is your software doing at runtime?
What the hell is your software doing at runtime?What the hell is your software doing at runtime?
What the hell is your software doing at runtime?
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big data
 
Scaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamScaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic Beam
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual Ways
 
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
 
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
 
Redis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaRedis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of Lua
 
Getting Started with Redis
Getting Started with RedisGetting Started with Redis
Getting Started with Redis
 
RespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellRespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShell
 
UV logic using redis bitmap
UV logic using redis bitmapUV logic using redis bitmap
UV logic using redis bitmap
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoPro
 
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre... Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulRedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
 

Similar to Stream Duplicate Detection with Redis Counters and LUA Scripting

Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Joe Arnold
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyTim Bunce
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http acceleratorno no
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisRicard Clau
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleJérôme Petazzoni
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchVic Hargrave
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton insertsChris Adkin
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xrkr10
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBob Ward
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQJérôme Petazzoni
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Jérôme Petazzoni
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesNETWAYS
 
dba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisdba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisLiviu Costea
 
Privilege Escalation with Metasploit
Privilege Escalation with MetasploitPrivilege Escalation with Metasploit
Privilege Escalation with Metasploitegypt
 
FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018Xavier Mertens
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNoSuchCon
 

Similar to Stream Duplicate Detection with Redis Counters and LUA Scripting (20)

Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http accelerator
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Advances in Open Source Password Cracking
Advances in Open Source Password CrackingAdvances in Open Source Password Cracking
Advances in Open Source Password Cracking
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
 
dba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisdba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redis
 
Privilege Escalation with Metasploit
Privilege Escalation with MetasploitPrivilege Escalation with Metasploit
Privilege Escalation with Metasploit
 
FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
 

Recently uploaded

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

Stream Duplicate Detection with Redis Counters and LUA Scripting

  • 1. for duplicate detection on real time stream
  • 2. whoami(1) 15 years of experience, proud to be a programmer Writes software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwords Implements scalable architectures Member of the JUG-Torino coordination team ro.franchini@gmail.com github.com/robfrank twitter.com/robfrankie linkedin.com/in/robfrank http://www.celi.it http://www.blogmeter.it
  • 3. Agenda What is it? Main features Caching Counters Scripting How we use it
  • 4. From the site Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.
  • 5. Who use it Twitter Github Youporn Pinterest Groupon ...
  • 6. Ecosystem Clients in every known language Articles, books, presentations On High Scalability every other day
  • 7. Architecture Single-threaded server Yes: single threaded server Remember that when you need to scale Single Linux server can handle 500k req/s
  • 8. Main features In memory K/V store But with durable persistence Master-slave async replica Transactions Pub/Sub Server side LUA scripting
  • 9. Main features Keys with TTL LRU eviction Keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs REDIS cluster on the go (3.0.0-rc1)
  • 10. K/V store Key-value (KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. (wikipedia)
  • 11. K/V store Key “plain text” name rob surname frank A C E D B F A B C D E F String/blobs/bitmaps HashTable: Objects Linked lists Sets
  • 12. Persistence Configurable, two flavors RDB: perfect for backup AOF: append only log, replayed at startup Use AOF + RDB for rock solid persistence Automatic cache warm-up at startup!! Only RAM: switch off persistence
  • 13. Common use cases Cache Queue Session replication In memory indexes Centralized ID generation
  • 14. Basics SET user:1 frank GET user:1 → frank EXISTS user:2 → 1 EXPIRE user:1 3600 INCR count:1 GET count:1 → 1
  • 15. Basics KEYS user:* → user:1, user:2 MSET user:1 frank user:2 coder MGET user:1 user:2 → frank, coder HMSET userdetail:3 name rob surname frank HGETALL userdetail:3 → name::rob, surname:: frank
  • 16. Transactions MULTI INCR counter:1 INCR counter:2 EXEC > 1 > 1 WATCH counter:3 val = GET counter:3 val = val +1 MULTI SET counter:3 $val EXEC
  • 17. Atomic counters Operators for key increment INCR counter:1 GET counter:1 → 1 INCRBY counter:1 9 GET counter:1 → 10
  • 18. LUA scripting Server side LUA scripting A “sort of” stored procedure Scripts are sandboxed Atomic execution ← bear in mind
  • 19. LUA scripting SCRIPT LOAD "return {KEYS[1],KEYS[2]}" "3905aac1828a8f75707b48e446988eaaeb173f13" EVALSHA 3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 user:2 1) "user:1" 2) "user:2"
  • 20. Caching: server level Configure REDIS as a cache maxmemory 1024mb maxmemory-policy allkeys-lru all the keys will be evicted using an approximated LRU algorithm
  • 21. Caching: TTL on key Set a timeout on a key SET doc:1 “mydoc.txt” EXIPRE doc:1 10 Or SETEX doc:1 10 “mydoc.txt”
  • 22. Demo
  • 23. Caching + Atomic Counters + Atomic LUA scripting
  • 24. Duplicate detection Real time stream of documents from the Internet 20% to 50% of documents are duplicated DUPLICATES ARE EVIL And customers don’t pay for that :(
  • 25. Basic Scenario 5M 3M 3M Duplicates Producer Producer detector NLP Storage Producer
  • 26. Avoid duplicated documents Act on producers was TOO HARD Filter-out them before heavy document analysis (NLP)
  • 27. Documents “Documents” are from: twitter facebook gplus instagram forums blogs
  • 28. Documents Each kind of document has its own natural id twitter: status id facebook: post id forum: URL blog: URL We don’t want this IDs inside our system
  • 29. Duplicate and id generation Producer 2M Producer Producer Duplicate detector - ID generatio n Analysis Storage 3M 3M Duplicate detector - ID generatio n 1M Analysis 1M 5M
  • 30. Map external keys to internal UID Generate an ID for each document IDs are generated using daily named counters: INCR day:20141028 → 12576 INCR day:20141010 → 23412576 Cache generated ID tw_1234578688 → day:20141028;12576
  • 31. Map external keys to internal UID Documents are internally stored on different storage systems with their generated id globalId→ 20141028:3456789
  • 32. Operations Natural Keys are cached with TTL Documents out of time are parked in a staging area Duplicated documents are usually dropped
  • 33. LRU cache, counters and LUA LUA scripts are executed atomically Wrote a simple script to: return previous mapped id or generate id and store key and id in cache EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 GET tw_1234566 → 20141028:123
  • 34. Demo
  • 35. Deployment Pre-production phase Single server 70M keys in 10GB of RAM In production with a simple M/S
  • 36. Alternatives PostgreSQL sequence(s) table OR hstore Hazelcast (we are java based) in memory write your own persistence
  • 37. Q/A
  • 38. References http://redis.io/ http://redis.io/commands http://stackoverflow.com/questions/tagged/redis http://try.redis.io/