Failing Fast with Redis 
backed BloomFilters 
• Christopher Curtin 
• Head of Technical 
Research 
• @ChrisCurtin
About Me 
 25+ years in technology 
 Head of Technical Research at Silverpop, an IBM Company (14 + years at 
Silverpop) 
 Built a SaaS platform before the term ‘SaaS’ was being used 
 Prior to Silverpop: real-time control systems, factory automation and 
warehouse management 
 Always looking for technologies and algorithms to help with our 
challenges
Silverpop Open Positions 
 Technical Lead 
 Senior Engineer 
 Architect 
 Automation Engineers
Agenda 
 Redis 
 Bloom Filters 
 Failing Fast
Agenda 
 Redis 
 What it is 
 Why we started looking at using it 
 Basics 
 Concurrency 
 Operational Considerations 
 Challenges
Redis – What is it? 
From redis.io: 
"Redis is an open source, BSD licensed, advanced key-value cache and store. 
It is often referred to as a data structure server since keys can contain strings, 
hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."
Hyper-what-what? 
HyperLogLog 
Approximation technique for counting distinct entries in a set. 
Very small memory footprint for rough approximations (16 kb for 99% 
accuracy) 
Nice – but too much loss for what we need
Features 
• Unlike typical key-value stores, you can send commands to edit the 
value on the server vs. reading back to the client, updating and 
pushing to the server 
• pub/sub 
•TTL on keys 
•Clustering and automatic fail-over 
•Lua scripting 
•client libraries for just about any language you can think of
So Why did we start looking at NoSQL? 
“For the cost of an Oracle Enterprise license I can give you 64 cores 
and 3 TB of memory”
Redis Basics 
 In Memory-only key-value store 
 Single Threaded. Yes, Single Threaded 
 No Paging, no reading from disk 
 CS 101 data structures and operations 
 10's of millions of keys isn't a big deal 
 How much RAM defines how big the store can get
Basic DataTypes 
 String 
 Hashes 
 Lists 
 Sets and Sorted Sets 
CS 101 ...
Hashes 
Hashes 
- collection of key-value pairs with a single name 
- useful for storing data under a common name 
- values can only be strings or numeric. No hash of lists 
http://redis.io/commands/hget
Sets and Sorted Sets 
 Buckets of values with very fast membership look-up 
 No duplicates allowed 
 Sorted Sets have scores to make them sortable 
– Automatically keeps them in order for fast 'top x' look ups 
http://redis.io/commands/zadd 
http://redis.io/commands/zrange
Lists 
 Most interesting due to how operations are applied to the remote 
store 
 Unbounded (except by memory) 
 Atomic operations between lists (pop from one, push to another) 
 CS 101: lpush, rpush, lpop, range etc. 
 Advanced: blocking pops 
Http://redis.io/commands/rpush 
http://redis.io/commands/rpoplpush
Concurrency 
 Single threaded 
 Each operation can work on one or two keys, atomically 
 Pipelines allow execution of commands in sequence in a single 
server request (Redis will only execute the pipeline) 
 Pipelines do not allow for logic between commands 
 LUA Scripts allow for logic between commands 
 BE CAREFUL with LUA, scripts block all clients!
Pipeline Java Example 
 BloomFilterRedis.java line 43
Lua Example 
 Lua-scripts example
Operational Information 
 Persistence can be 'none', journal (AOF) or point in time (RDB) 
 Optional Master/Slave replication 
 Home-grown HA platform (Sentinel) 
 Common deployment model is lots of instances per machine 
 Millions of keys gets hard to manage – build 'directory' hashes to 
make it easier for operations to find keys to look at
Challenges with Redis 
 Key Explosion – single name space 
 LUA scripts can block all others users 
 Pipelines can block all other users 
 No nested data types (I want a hash of lists!) 
 Without name spaces be cautious of how you define key names
Concurrency Demo – JMS replacement 
 Client submits a request to the queue (LPUSH) 
 Consumer application polls for work when worker is available 
(RPOPLPUSH) 
 Worker executes the task assigned to it 
 When worker is done, its list is removed 
 Lather, Rinse, Repeat 
 (We provide a hash of workers for Operations to query for 
monitoring)
Agenda 
 Bloomfilters 
 What they are 
 Why we started looking at using them 
 Basics 
 False Positives 
 Example Uses 
 Why not do this in a database?
Bloom Filters 
From WikiPedia (Don't tell my kid's teacher!) 
"A Bloom filter is a space-efficient probabilistic data structure, 
conceived by Burton Howard Bloom in 1970, that is used to test 
whether an element is a member of a set. False positive matches are 
possible, but false negatives are not, thus a Bloom filter has a 100% 
recall rate"
Hashing 
 Apply 'x' hash functions to the key to be stored/queried 
 Each function returns a bit to set in the bitset 
 Mathematical equations to determine how big to make the bitset, 
how many functions to use and your acceptable error level 
 http://hur.st/bloomfilter?n=4&p=1.0E-20
Example
False Positives 
 Perfect hash functions aren't worth the cost to develop 
 Sometimes existing bits for a key are set by many other keys 
 Make sure you understand the business impact of a false positive 
 Remember, never a false negative
Creation 
 Libraries are available for every language I looked up (even 
JavaScript) 
 Some are built in memory, for a single process/JVM to use 
 Read-only (ad networks) are built using Hadoop and loaded into 
memory 
 In memory is great for lots of reads, single process/JVM etc. 
 But ...
Updates 
 Updating a 16 MB structure in memory and persisting to disk is 
expensive 
 8 bits change and you write 16 MB!!!!!! (DBAs will love you …)
Deletes 
 Not possible in a regular Bloom Filter – how would you know what bits 
are used by other keys? 
 Counting BloomFilters keep a few bits (3-4) per bit in the bitmap as a 
counter. 'delete' decrements the key 
 Not as space friendly any more … 
 Instead, consider having bloom filters based around the lifetime of the 
data to be queried 
– For a filter 'visited in the last 4 hours' have 4 filters and age the oldest 
out (TTL in Redis maybe ...)
Issue: Persistence 
 Load a 16 MB filter from database to check 6 bits? 
 Worse: update 6 bits in a 16 MB filter 
 DBAs will not be happy 
– Undo/redo 
– SGA misses, page faults 
– Backups, replication traffic etc.
Why were we interested in Bloom Filters? 
 Found a lot of places we went to the database to find the data 
didn't exist 
 Found lots of places where we want to know if a user DIDN'T do 
something
Persistent Bloom Filters 
 We needed persistent Bloom Filters for lots of user stories 
 Found Orestes-BloomFilter on GitHub that used Redis as a store 
and enhanced it 
 Added population filters 
 Fixed a few bugs 
 Did a pull request and it was accepted!
Benefits 
 Filters are stored in Redis 
• Only bitset/bitget calls to server 
 Reads and updates of the filter from set of application servers 
 Persistence has a cost, but a fraction of the RDBMS costs 
 Can load a BF created offline and begin using it
Remember “For the cost of an Oracle License” 
 Thousands of filters 
 Dozens of Redis instances 
 TTL on a Redis key makes cleanup of old filters trivial
Population Bloom Filters 
 Unique need we had 
 Users access the system frequently, but I really only need to count 
them once per month for billing 
 10's of Thousands of clients, Finance wants monthly report in 
seconds 
 Logic is simple: if any bits weren't set for the key (user id), 
increment the counter 
 Note: there are mathematical methods of estimating a BF 
population but we needed better error rate
Example Uses of Bloom Filters 
 Webcache – what URLs are already in the cache on another 
server? 
 P2P networks – what node contains which part of the file? 
 Databases 
– Do keys exist in this page? If not, don't load the page 
– Hbase uses them to detect which blocks do not have the data (HDFS is write-once) 
– Many RDBMS use them internally to 'fail fast' and not load pages into memory 
– Sadly, no RDBMS or NoSQL I know of offers them as user data types
Example Uses of Bloom Filters 
 Ad networks (old way ...) 
– Big Hadoop job hourly/nightly to determine which ads to show 
based on prior behavior 
– Load the filter into a common storage (disk usually) 
– Ad servers load all the filters into memory and query for your 
cookie id to see what to show you
Examples of Redis-backed BloomFilters 
 Has the user be here this month? If not show them a Message. False 
positive doesn't matter 
 White vs. Black list for IP 
– Known bad IP in the filter 
– Upon login check the filter. Not found, login. Found – check DB to 
validate bad IP. 
– False Positive will lead to query that returns false, but should be rare 
• Ad Networks (real time BF updates based on what you searched on)
Client side Joins 
 Most NoSQL don't support joins 
 Architecture may have data across multiple stores 
 Keep a Population Bloom Filter by day of unique users in a data 
source 
 When needing to join, load smallest data source as the driver and 
query other sources in order of size 
 If queries are time based and filters are available for the time, 
looking up key matches can be very fast
Agenda 
 Fail Fast 
 What it is 
 Redis-backed BloomFilters 
 Examples
Fail Fast 
 The ability to quickly know to NOT do something expensive 
 Example: Black-list of IPs 
 Think about ways to NOT do some work 
 Cost of Redis servers is much less than an RDBMS license or the 
cost of a good DB server with storage!
Hammer Time
Be careful 
 Sometimes the cost of building and maintaining the structures 
outweighs the benefit 
 Convoluted designs to avoid the database 
 Collect Metrics on 'hits' to see if they are any benefit (CodaHale)
Example (naive) 
 Build a BF for ads shown to a user (hash on user id and ad id) 
 When the user visits, hash their user id and the top ad to display 
this hour and set the bits in the BF 
 If any were not set, the Population count is incremented and you 
display the ad 
 If already set, move to the next most important ad. 
 Now know total unique views by ad by hour 
 Can do total gross with a Redis Hash too!
Example – smarter 
 Hash the top 10 ad ids to the user id and parallel request (Pipeline) 
 Check the return to see which ones aren't set, submit an update 
request and set the population 
 2 round trips to check 10 ads. 
 (Can also do this in LUA in 1 round trip)
Example – part 2 
 Same idea as before, but build the bloom filter for each hour 
 When user visits, query last 6 filters in parallel (pipeline!) to see if 
they've seen the ad(s). 
 Redis TTL on the hourly filter will drop it automatically when it 
becomes too old
Example 3 
 Collect lots of data about users (such as virtual cows, farm land, 
chickens etc.) 
 Run a predictive model on the data and identify which special 
offers to show the user visits again. Store user ids in a Bloom Filter 
 Load the BF into Redis 
 Query each time the user logs in and display appropriate offer 
 No massive database insert/updates to flag who should see it 
 False positive isn't too bad
Example 4 – Query optimization 
 Client-side joins 
 Ask the Bloom Filter if the user has performed the action (filters 
for hour, day, week of year etc.) 
 If not, don't even call the data source 
 May need to read some extra data due to 'in the last 11 days' but 
asking the BF and being told 'no' prevents ANY data source 
resources to be used 
 What if the BF is lost? Rebuild it from the base events (Hadoop!)
Conclusion 
 Redis is a very fast, very simple and very powerful name value 
store “Data structure server” 
 Bloom Filters have lots of applications when you want to quickly 
look up if one of millions of 'things' happened 
 Redis-backed BloomFilters make updatable bloom filters trivial to 
use 
 Think about what you need to know to NOT do an expensive 
operation 
 Fail fast
References 
 Redis.io 
 http://en.wikipedia.org/wiki/Bloom_filter 
 http://hur.st/bloomfilter?n=4&p=1.0E-20 
 https://github.com/Baqend/Orestes-Bloomfilter 
 http://www.slideshare.net/chriscurtin 
 @ChrisCurtin on twitter 
 Github.com/chriscurtin
Questions?

Redis and Bloom Filters - Atlanta Java Users Group 9/2014

  • 1.
    Failing Fast withRedis backed BloomFilters • Christopher Curtin • Head of Technical Research • @ChrisCurtin
  • 2.
    About Me 25+ years in technology  Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)  Built a SaaS platform before the term ‘SaaS’ was being used  Prior to Silverpop: real-time control systems, factory automation and warehouse management  Always looking for technologies and algorithms to help with our challenges
  • 3.
    Silverpop Open Positions  Technical Lead  Senior Engineer  Architect  Automation Engineers
  • 4.
    Agenda  Redis  Bloom Filters  Failing Fast
  • 5.
    Agenda  Redis  What it is  Why we started looking at using it  Basics  Concurrency  Operational Considerations  Challenges
  • 6.
    Redis – Whatis it? From redis.io: "Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."
  • 7.
    Hyper-what-what? HyperLogLog Approximationtechnique for counting distinct entries in a set. Very small memory footprint for rough approximations (16 kb for 99% accuracy) Nice – but too much loss for what we need
  • 8.
    Features • Unliketypical key-value stores, you can send commands to edit the value on the server vs. reading back to the client, updating and pushing to the server • pub/sub •TTL on keys •Clustering and automatic fail-over •Lua scripting •client libraries for just about any language you can think of
  • 9.
    So Why didwe start looking at NoSQL? “For the cost of an Oracle Enterprise license I can give you 64 cores and 3 TB of memory”
  • 11.
    Redis Basics In Memory-only key-value store  Single Threaded. Yes, Single Threaded  No Paging, no reading from disk  CS 101 data structures and operations  10's of millions of keys isn't a big deal  How much RAM defines how big the store can get
  • 12.
    Basic DataTypes String  Hashes  Lists  Sets and Sorted Sets CS 101 ...
  • 13.
    Hashes Hashes -collection of key-value pairs with a single name - useful for storing data under a common name - values can only be strings or numeric. No hash of lists http://redis.io/commands/hget
  • 14.
    Sets and SortedSets  Buckets of values with very fast membership look-up  No duplicates allowed  Sorted Sets have scores to make them sortable – Automatically keeps them in order for fast 'top x' look ups http://redis.io/commands/zadd http://redis.io/commands/zrange
  • 15.
    Lists  Mostinteresting due to how operations are applied to the remote store  Unbounded (except by memory)  Atomic operations between lists (pop from one, push to another)  CS 101: lpush, rpush, lpop, range etc.  Advanced: blocking pops Http://redis.io/commands/rpush http://redis.io/commands/rpoplpush
  • 16.
    Concurrency  Singlethreaded  Each operation can work on one or two keys, atomically  Pipelines allow execution of commands in sequence in a single server request (Redis will only execute the pipeline)  Pipelines do not allow for logic between commands  LUA Scripts allow for logic between commands  BE CAREFUL with LUA, scripts block all clients!
  • 17.
    Pipeline Java Example  BloomFilterRedis.java line 43
  • 18.
    Lua Example Lua-scripts example
  • 19.
    Operational Information Persistence can be 'none', journal (AOF) or point in time (RDB)  Optional Master/Slave replication  Home-grown HA platform (Sentinel)  Common deployment model is lots of instances per machine  Millions of keys gets hard to manage – build 'directory' hashes to make it easier for operations to find keys to look at
  • 20.
    Challenges with Redis  Key Explosion – single name space  LUA scripts can block all others users  Pipelines can block all other users  No nested data types (I want a hash of lists!)  Without name spaces be cautious of how you define key names
  • 21.
    Concurrency Demo –JMS replacement  Client submits a request to the queue (LPUSH)  Consumer application polls for work when worker is available (RPOPLPUSH)  Worker executes the task assigned to it  When worker is done, its list is removed  Lather, Rinse, Repeat  (We provide a hash of workers for Operations to query for monitoring)
  • 22.
    Agenda  Bloomfilters  What they are  Why we started looking at using them  Basics  False Positives  Example Uses  Why not do this in a database?
  • 23.
    Bloom Filters FromWikiPedia (Don't tell my kid's teacher!) "A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate"
  • 24.
    Hashing  Apply'x' hash functions to the key to be stored/queried  Each function returns a bit to set in the bitset  Mathematical equations to determine how big to make the bitset, how many functions to use and your acceptable error level  http://hur.st/bloomfilter?n=4&p=1.0E-20
  • 25.
  • 26.
    False Positives Perfect hash functions aren't worth the cost to develop  Sometimes existing bits for a key are set by many other keys  Make sure you understand the business impact of a false positive  Remember, never a false negative
  • 27.
    Creation  Librariesare available for every language I looked up (even JavaScript)  Some are built in memory, for a single process/JVM to use  Read-only (ad networks) are built using Hadoop and loaded into memory  In memory is great for lots of reads, single process/JVM etc.  But ...
  • 28.
    Updates  Updatinga 16 MB structure in memory and persisting to disk is expensive  8 bits change and you write 16 MB!!!!!! (DBAs will love you …)
  • 29.
    Deletes  Notpossible in a regular Bloom Filter – how would you know what bits are used by other keys?  Counting BloomFilters keep a few bits (3-4) per bit in the bitmap as a counter. 'delete' decrements the key  Not as space friendly any more …  Instead, consider having bloom filters based around the lifetime of the data to be queried – For a filter 'visited in the last 4 hours' have 4 filters and age the oldest out (TTL in Redis maybe ...)
  • 30.
    Issue: Persistence Load a 16 MB filter from database to check 6 bits?  Worse: update 6 bits in a 16 MB filter  DBAs will not be happy – Undo/redo – SGA misses, page faults – Backups, replication traffic etc.
  • 31.
    Why were weinterested in Bloom Filters?  Found a lot of places we went to the database to find the data didn't exist  Found lots of places where we want to know if a user DIDN'T do something
  • 32.
    Persistent Bloom Filters  We needed persistent Bloom Filters for lots of user stories  Found Orestes-BloomFilter on GitHub that used Redis as a store and enhanced it  Added population filters  Fixed a few bugs  Did a pull request and it was accepted!
  • 33.
    Benefits  Filtersare stored in Redis • Only bitset/bitget calls to server  Reads and updates of the filter from set of application servers  Persistence has a cost, but a fraction of the RDBMS costs  Can load a BF created offline and begin using it
  • 34.
    Remember “For thecost of an Oracle License”  Thousands of filters  Dozens of Redis instances  TTL on a Redis key makes cleanup of old filters trivial
  • 35.
    Population Bloom Filters  Unique need we had  Users access the system frequently, but I really only need to count them once per month for billing  10's of Thousands of clients, Finance wants monthly report in seconds  Logic is simple: if any bits weren't set for the key (user id), increment the counter  Note: there are mathematical methods of estimating a BF population but we needed better error rate
  • 36.
    Example Uses ofBloom Filters  Webcache – what URLs are already in the cache on another server?  P2P networks – what node contains which part of the file?  Databases – Do keys exist in this page? If not, don't load the page – Hbase uses them to detect which blocks do not have the data (HDFS is write-once) – Many RDBMS use them internally to 'fail fast' and not load pages into memory – Sadly, no RDBMS or NoSQL I know of offers them as user data types
  • 37.
    Example Uses ofBloom Filters  Ad networks (old way ...) – Big Hadoop job hourly/nightly to determine which ads to show based on prior behavior – Load the filter into a common storage (disk usually) – Ad servers load all the filters into memory and query for your cookie id to see what to show you
  • 38.
    Examples of Redis-backedBloomFilters  Has the user be here this month? If not show them a Message. False positive doesn't matter  White vs. Black list for IP – Known bad IP in the filter – Upon login check the filter. Not found, login. Found – check DB to validate bad IP. – False Positive will lead to query that returns false, but should be rare • Ad Networks (real time BF updates based on what you searched on)
  • 39.
    Client side Joins  Most NoSQL don't support joins  Architecture may have data across multiple stores  Keep a Population Bloom Filter by day of unique users in a data source  When needing to join, load smallest data source as the driver and query other sources in order of size  If queries are time based and filters are available for the time, looking up key matches can be very fast
  • 40.
    Agenda  FailFast  What it is  Redis-backed BloomFilters  Examples
  • 41.
    Fail Fast The ability to quickly know to NOT do something expensive  Example: Black-list of IPs  Think about ways to NOT do some work  Cost of Redis servers is much less than an RDBMS license or the cost of a good DB server with storage!
  • 42.
  • 43.
    Be careful Sometimes the cost of building and maintaining the structures outweighs the benefit  Convoluted designs to avoid the database  Collect Metrics on 'hits' to see if they are any benefit (CodaHale)
  • 44.
    Example (naive) Build a BF for ads shown to a user (hash on user id and ad id)  When the user visits, hash their user id and the top ad to display this hour and set the bits in the BF  If any were not set, the Population count is incremented and you display the ad  If already set, move to the next most important ad.  Now know total unique views by ad by hour  Can do total gross with a Redis Hash too!
  • 45.
    Example – smarter  Hash the top 10 ad ids to the user id and parallel request (Pipeline)  Check the return to see which ones aren't set, submit an update request and set the population  2 round trips to check 10 ads.  (Can also do this in LUA in 1 round trip)
  • 46.
    Example – part2  Same idea as before, but build the bloom filter for each hour  When user visits, query last 6 filters in parallel (pipeline!) to see if they've seen the ad(s).  Redis TTL on the hourly filter will drop it automatically when it becomes too old
  • 47.
    Example 3 Collect lots of data about users (such as virtual cows, farm land, chickens etc.)  Run a predictive model on the data and identify which special offers to show the user visits again. Store user ids in a Bloom Filter  Load the BF into Redis  Query each time the user logs in and display appropriate offer  No massive database insert/updates to flag who should see it  False positive isn't too bad
  • 48.
    Example 4 –Query optimization  Client-side joins  Ask the Bloom Filter if the user has performed the action (filters for hour, day, week of year etc.)  If not, don't even call the data source  May need to read some extra data due to 'in the last 11 days' but asking the BF and being told 'no' prevents ANY data source resources to be used  What if the BF is lost? Rebuild it from the base events (Hadoop!)
  • 49.
    Conclusion  Redisis a very fast, very simple and very powerful name value store “Data structure server”  Bloom Filters have lots of applications when you want to quickly look up if one of millions of 'things' happened  Redis-backed BloomFilters make updatable bloom filters trivial to use  Think about what you need to know to NOT do an expensive operation  Fail fast
  • 50.
    References  Redis.io  http://en.wikipedia.org/wiki/Bloom_filter  http://hur.st/bloomfilter?n=4&p=1.0E-20  https://github.com/Baqend/Orestes-Bloomfilter  http://www.slideshare.net/chriscurtin  @ChrisCurtin on twitter  Github.com/chriscurtin
  • 51.