This document discusses using Redis to enable failing fast with Bloom filters. It provides an overview of Redis as a data structure server and key-value cache, as well as Bloom filters as a space-efficient probabilistic data structure for membership testing. By storing Bloom filters in Redis, they can be made updatable and persistent while avoiding the high costs of updating and querying large Bloom filters stored in databases. Examples are given where Redis-backed Bloom filters can be used to optimize queries and decide whether expensive operations need to be performed.
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
1. Failing Fast with Redis
backed BloomFilters
• Christopher Curtin
• Head of Technical
Research
• @ChrisCurtin
2. About Me
25+ years in technology
Head of Technical Research at Silverpop, an IBM Company (14 + years at
Silverpop)
Built a SaaS platform before the term ‘SaaS’ was being used
Prior to Silverpop: real-time control systems, factory automation and
warehouse management
Always looking for technologies and algorithms to help with our
challenges
5. Agenda
Redis
What it is
Why we started looking at using it
Basics
Concurrency
Operational Considerations
Challenges
6. Redis – What is it?
From redis.io:
"Redis is an open source, BSD licensed, advanced key-value cache and store.
It is often referred to as a data structure server since keys can contain strings,
hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."
7. Hyper-what-what?
HyperLogLog
Approximation technique for counting distinct entries in a set.
Very small memory footprint for rough approximations (16 kb for 99%
accuracy)
Nice – but too much loss for what we need
8. Features
• Unlike typical key-value stores, you can send commands to edit the
value on the server vs. reading back to the client, updating and
pushing to the server
• pub/sub
•TTL on keys
•Clustering and automatic fail-over
•Lua scripting
•client libraries for just about any language you can think of
9. So Why did we start looking at NoSQL?
“For the cost of an Oracle Enterprise license I can give you 64 cores
and 3 TB of memory”
10.
11. Redis Basics
In Memory-only key-value store
Single Threaded. Yes, Single Threaded
No Paging, no reading from disk
CS 101 data structures and operations
10's of millions of keys isn't a big deal
How much RAM defines how big the store can get
13. Hashes
Hashes
- collection of key-value pairs with a single name
- useful for storing data under a common name
- values can only be strings or numeric. No hash of lists
http://redis.io/commands/hget
14. Sets and Sorted Sets
Buckets of values with very fast membership look-up
No duplicates allowed
Sorted Sets have scores to make them sortable
– Automatically keeps them in order for fast 'top x' look ups
http://redis.io/commands/zadd
http://redis.io/commands/zrange
15. Lists
Most interesting due to how operations are applied to the remote
store
Unbounded (except by memory)
Atomic operations between lists (pop from one, push to another)
CS 101: lpush, rpush, lpop, range etc.
Advanced: blocking pops
Http://redis.io/commands/rpush
http://redis.io/commands/rpoplpush
16. Concurrency
Single threaded
Each operation can work on one or two keys, atomically
Pipelines allow execution of commands in sequence in a single
server request (Redis will only execute the pipeline)
Pipelines do not allow for logic between commands
LUA Scripts allow for logic between commands
BE CAREFUL with LUA, scripts block all clients!
19. Operational Information
Persistence can be 'none', journal (AOF) or point in time (RDB)
Optional Master/Slave replication
Home-grown HA platform (Sentinel)
Common deployment model is lots of instances per machine
Millions of keys gets hard to manage – build 'directory' hashes to
make it easier for operations to find keys to look at
20. Challenges with Redis
Key Explosion – single name space
LUA scripts can block all others users
Pipelines can block all other users
No nested data types (I want a hash of lists!)
Without name spaces be cautious of how you define key names
21. Concurrency Demo – JMS replacement
Client submits a request to the queue (LPUSH)
Consumer application polls for work when worker is available
(RPOPLPUSH)
Worker executes the task assigned to it
When worker is done, its list is removed
Lather, Rinse, Repeat
(We provide a hash of workers for Operations to query for
monitoring)
22. Agenda
Bloomfilters
What they are
Why we started looking at using them
Basics
False Positives
Example Uses
Why not do this in a database?
23. Bloom Filters
From WikiPedia (Don't tell my kid's teacher!)
"A Bloom filter is a space-efficient probabilistic data structure,
conceived by Burton Howard Bloom in 1970, that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not, thus a Bloom filter has a 100%
recall rate"
24. Hashing
Apply 'x' hash functions to the key to be stored/queried
Each function returns a bit to set in the bitset
Mathematical equations to determine how big to make the bitset,
how many functions to use and your acceptable error level
http://hur.st/bloomfilter?n=4&p=1.0E-20
26. False Positives
Perfect hash functions aren't worth the cost to develop
Sometimes existing bits for a key are set by many other keys
Make sure you understand the business impact of a false positive
Remember, never a false negative
27. Creation
Libraries are available for every language I looked up (even
JavaScript)
Some are built in memory, for a single process/JVM to use
Read-only (ad networks) are built using Hadoop and loaded into
memory
In memory is great for lots of reads, single process/JVM etc.
But ...
28. Updates
Updating a 16 MB structure in memory and persisting to disk is
expensive
8 bits change and you write 16 MB!!!!!! (DBAs will love you …)
29. Deletes
Not possible in a regular Bloom Filter – how would you know what bits
are used by other keys?
Counting BloomFilters keep a few bits (3-4) per bit in the bitmap as a
counter. 'delete' decrements the key
Not as space friendly any more …
Instead, consider having bloom filters based around the lifetime of the
data to be queried
– For a filter 'visited in the last 4 hours' have 4 filters and age the oldest
out (TTL in Redis maybe ...)
30. Issue: Persistence
Load a 16 MB filter from database to check 6 bits?
Worse: update 6 bits in a 16 MB filter
DBAs will not be happy
– Undo/redo
– SGA misses, page faults
– Backups, replication traffic etc.
31. Why were we interested in Bloom Filters?
Found a lot of places we went to the database to find the data
didn't exist
Found lots of places where we want to know if a user DIDN'T do
something
32. Persistent Bloom Filters
We needed persistent Bloom Filters for lots of user stories
Found Orestes-BloomFilter on GitHub that used Redis as a store
and enhanced it
Added population filters
Fixed a few bugs
Did a pull request and it was accepted!
33. Benefits
Filters are stored in Redis
• Only bitset/bitget calls to server
Reads and updates of the filter from set of application servers
Persistence has a cost, but a fraction of the RDBMS costs
Can load a BF created offline and begin using it
34. Remember “For the cost of an Oracle License”
Thousands of filters
Dozens of Redis instances
TTL on a Redis key makes cleanup of old filters trivial
35. Population Bloom Filters
Unique need we had
Users access the system frequently, but I really only need to count
them once per month for billing
10's of Thousands of clients, Finance wants monthly report in
seconds
Logic is simple: if any bits weren't set for the key (user id),
increment the counter
Note: there are mathematical methods of estimating a BF
population but we needed better error rate
36. Example Uses of Bloom Filters
Webcache – what URLs are already in the cache on another
server?
P2P networks – what node contains which part of the file?
Databases
– Do keys exist in this page? If not, don't load the page
– Hbase uses them to detect which blocks do not have the data (HDFS is write-once)
– Many RDBMS use them internally to 'fail fast' and not load pages into memory
– Sadly, no RDBMS or NoSQL I know of offers them as user data types
37. Example Uses of Bloom Filters
Ad networks (old way ...)
– Big Hadoop job hourly/nightly to determine which ads to show
based on prior behavior
– Load the filter into a common storage (disk usually)
– Ad servers load all the filters into memory and query for your
cookie id to see what to show you
38. Examples of Redis-backed BloomFilters
Has the user be here this month? If not show them a Message. False
positive doesn't matter
White vs. Black list for IP
– Known bad IP in the filter
– Upon login check the filter. Not found, login. Found – check DB to
validate bad IP.
– False Positive will lead to query that returns false, but should be rare
• Ad Networks (real time BF updates based on what you searched on)
39. Client side Joins
Most NoSQL don't support joins
Architecture may have data across multiple stores
Keep a Population Bloom Filter by day of unique users in a data
source
When needing to join, load smallest data source as the driver and
query other sources in order of size
If queries are time based and filters are available for the time,
looking up key matches can be very fast
40. Agenda
Fail Fast
What it is
Redis-backed BloomFilters
Examples
41. Fail Fast
The ability to quickly know to NOT do something expensive
Example: Black-list of IPs
Think about ways to NOT do some work
Cost of Redis servers is much less than an RDBMS license or the
cost of a good DB server with storage!
43. Be careful
Sometimes the cost of building and maintaining the structures
outweighs the benefit
Convoluted designs to avoid the database
Collect Metrics on 'hits' to see if they are any benefit (CodaHale)
44. Example (naive)
Build a BF for ads shown to a user (hash on user id and ad id)
When the user visits, hash their user id and the top ad to display
this hour and set the bits in the BF
If any were not set, the Population count is incremented and you
display the ad
If already set, move to the next most important ad.
Now know total unique views by ad by hour
Can do total gross with a Redis Hash too!
45. Example – smarter
Hash the top 10 ad ids to the user id and parallel request (Pipeline)
Check the return to see which ones aren't set, submit an update
request and set the population
2 round trips to check 10 ads.
(Can also do this in LUA in 1 round trip)
46. Example – part 2
Same idea as before, but build the bloom filter for each hour
When user visits, query last 6 filters in parallel (pipeline!) to see if
they've seen the ad(s).
Redis TTL on the hourly filter will drop it automatically when it
becomes too old
47. Example 3
Collect lots of data about users (such as virtual cows, farm land,
chickens etc.)
Run a predictive model on the data and identify which special
offers to show the user visits again. Store user ids in a Bloom Filter
Load the BF into Redis
Query each time the user logs in and display appropriate offer
No massive database insert/updates to flag who should see it
False positive isn't too bad
48. Example 4 – Query optimization
Client-side joins
Ask the Bloom Filter if the user has performed the action (filters
for hour, day, week of year etc.)
If not, don't even call the data source
May need to read some extra data due to 'in the last 11 days' but
asking the BF and being told 'no' prevents ANY data source
resources to be used
What if the BF is lost? Rebuild it from the base events (Hadoop!)
49. Conclusion
Redis is a very fast, very simple and very powerful name value
store “Data structure server”
Bloom Filters have lots of applications when you want to quickly
look up if one of millions of 'things' happened
Redis-backed BloomFilters make updatable bloom filters trivial to
use
Think about what you need to know to NOT do an expensive
operation
Fail fast