Redis and Bloom Filters - Atlanta Java Users Group 9/2014

Failing Fast with Redis
backed BloomFilters
• Christopher Curtin
• Head of Technical
Research
• @ChrisCurtin

About Me
 25+ years in technology
 Head of Technical Research at Silverpop, an IBM Company (14 + years at
Silverpop)
 Built a SaaS platform before the term ‘SaaS’ was being used
 Prior to Silverpop: real-time control systems, factory automation and
warehouse management
 Always looking for technologies and algorithms to help with our
challenges

Silverpop Open Positions
 Technical Lead
 Senior Engineer
 Architect
 Automation Engineers

Agenda
 Redis
 Bloom Filters
 Failing Fast

Agenda
 Redis
 What it is
 Why we started looking at using it
 Basics
 Concurrency
 Operational Considerations
 Challenges

Redis – What is it?
From redis.io:
"Redis is an open source, BSD licensed, advanced key-value cache and store.
It is often referred to as a data structure server since keys can contain strings,
hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."

Hyper-what-what?
HyperLogLog
Approximation technique for counting distinct entries in a set.
Very small memory footprint for rough approximations (16 kb for 99%
accuracy)
Nice – but too much loss for what we need

Features
• Unlike typical key-value stores, you can send commands to edit the
value on the server vs. reading back to the client, updating and
pushing to the server
• pub/sub
•TTL on keys
•Clustering and automatic fail-over
•Lua scripting
•client libraries for just about any language you can think of

So Why did we start looking at NoSQL?
“For the cost of an Oracle Enterprise license I can give you 64 cores
and 3 TB of memory”

Redis Basics
 In Memory-only key-value store
 Single Threaded. Yes, Single Threaded
 No Paging, no reading from disk
 CS 101 data structures and operations
 10's of millions of keys isn't a big deal
 How much RAM defines how big the store can get

Basic DataTypes
 String
 Hashes
 Lists
 Sets and Sorted Sets
CS 101 ...

Hashes
Hashes
- collection of key-value pairs with a single name
- useful for storing data under a common name
- values can only be strings or numeric. No hash of lists
http://redis.io/commands/hget

Sets and Sorted Sets
 Buckets of values with very fast membership look-up
 No duplicates allowed
 Sorted Sets have scores to make them sortable
– Automatically keeps them in order for fast 'top x' look ups
http://redis.io/commands/zadd
http://redis.io/commands/zrange

Lists
 Most interesting due to how operations are applied to the remote
store
 Unbounded (except by memory)
 Atomic operations between lists (pop from one, push to another)
 CS 101: lpush, rpush, lpop, range etc.
 Advanced: blocking pops
Http://redis.io/commands/rpush
http://redis.io/commands/rpoplpush

Concurrency
 Single threaded
 Each operation can work on one or two keys, atomically
 Pipelines allow execution of commands in sequence in a single
server request (Redis will only execute the pipeline)
 Pipelines do not allow for logic between commands
 LUA Scripts allow for logic between commands
 BE CAREFUL with LUA, scripts block all clients!

Pipeline Java Example
 BloomFilterRedis.java line 43

Lua Example
 Lua-scripts example

Operational Information
 Persistence can be 'none', journal (AOF) or point in time (RDB)
 Optional Master/Slave replication
 Home-grown HA platform (Sentinel)
 Common deployment model is lots of instances per machine
 Millions of keys gets hard to manage – build 'directory' hashes to
make it easier for operations to find keys to look at

Challenges with Redis
 Key Explosion – single name space
 LUA scripts can block all others users
 Pipelines can block all other users
 No nested data types (I want a hash of lists!)
 Without name spaces be cautious of how you define key names

Concurrency Demo – JMS replacement
 Client submits a request to the queue (LPUSH)
 Consumer application polls for work when worker is available
(RPOPLPUSH)
 Worker executes the task assigned to it
 When worker is done, its list is removed
 Lather, Rinse, Repeat
 (We provide a hash of workers for Operations to query for
monitoring)

Agenda
 Bloomfilters
 What they are
 Why we started looking at using them
 Basics
 False Positives
 Example Uses
 Why not do this in a database?

Bloom Filters
From WikiPedia (Don't tell my kid's teacher!)
"A Bloom filter is a space-efficient probabilistic data structure,
conceived by Burton Howard Bloom in 1970, that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not, thus a Bloom filter has a 100%
recall rate"

Hashing
 Apply 'x' hash functions to the key to be stored/queried
 Each function returns a bit to set in the bitset
 Mathematical equations to determine how big to make the bitset,
how many functions to use and your acceptable error level
 http://hur.st/bloomfilter?n=4&p=1.0E-20

False Positives
 Perfect hash functions aren't worth the cost to develop
 Sometimes existing bits for a key are set by many other keys
 Make sure you understand the business impact of a false positive
 Remember, never a false negative

Creation
 Libraries are available for every language I looked up (even
JavaScript)
 Some are built in memory, for a single process/JVM to use
 Read-only (ad networks) are built using Hadoop and loaded into
memory
 In memory is great for lots of reads, single process/JVM etc.
 But ...

Updates
 Updating a 16 MB structure in memory and persisting to disk is
expensive
 8 bits change and you write 16 MB!!!!!! (DBAs will love you …)

Deletes
 Not possible in a regular Bloom Filter – how would you know what bits
are used by other keys?
 Counting BloomFilters keep a few bits (3-4) per bit in the bitmap as a
counter. 'delete' decrements the key
 Not as space friendly any more …
 Instead, consider having bloom filters based around the lifetime of the
data to be queried
– For a filter 'visited in the last 4 hours' have 4 filters and age the oldest
out (TTL in Redis maybe ...)

Issue: Persistence
 Load a 16 MB filter from database to check 6 bits?
 Worse: update 6 bits in a 16 MB filter
 DBAs will not be happy
– Undo/redo
– SGA misses, page faults
– Backups, replication traffic etc.

Why were we interested in Bloom Filters?
 Found a lot of places we went to the database to find the data
didn't exist
 Found lots of places where we want to know if a user DIDN'T do
something

Persistent Bloom Filters
 We needed persistent Bloom Filters for lots of user stories
 Found Orestes-BloomFilter on GitHub that used Redis as a store
and enhanced it
 Added population filters
 Fixed a few bugs
 Did a pull request and it was accepted!

Benefits
 Filters are stored in Redis
• Only bitset/bitget calls to server
 Reads and updates of the filter from set of application servers
 Persistence has a cost, but a fraction of the RDBMS costs
 Can load a BF created offline and begin using it

Remember “For the cost of an Oracle License”
 Thousands of filters
 Dozens of Redis instances
 TTL on a Redis key makes cleanup of old filters trivial

Population Bloom Filters
 Unique need we had
 Users access the system frequently, but I really only need to count
them once per month for billing
 10's of Thousands of clients, Finance wants monthly report in
seconds
 Logic is simple: if any bits weren't set for the key (user id),
increment the counter
 Note: there are mathematical methods of estimating a BF
population but we needed better error rate

Example Uses of Bloom Filters
 Webcache – what URLs are already in the cache on another
server?
 P2P networks – what node contains which part of the file?
 Databases
– Do keys exist in this page? If not, don't load the page
– Hbase uses them to detect which blocks do not have the data (HDFS is write-once)
– Many RDBMS use them internally to 'fail fast' and not load pages into memory
– Sadly, no RDBMS or NoSQL I know of offers them as user data types

Example Uses of Bloom Filters
 Ad networks (old way ...)
– Big Hadoop job hourly/nightly to determine which ads to show
based on prior behavior
– Load the filter into a common storage (disk usually)
– Ad servers load all the filters into memory and query for your
cookie id to see what to show you

Examples of Redis-backed BloomFilters
 Has the user be here this month? If not show them a Message. False
positive doesn't matter
 White vs. Black list for IP
– Known bad IP in the filter
– Upon login check the filter. Not found, login. Found – check DB to
validate bad IP.
– False Positive will lead to query that returns false, but should be rare
• Ad Networks (real time BF updates based on what you searched on)

Client side Joins
 Most NoSQL don't support joins
 Architecture may have data across multiple stores
 Keep a Population Bloom Filter by day of unique users in a data
source
 When needing to join, load smallest data source as the driver and
query other sources in order of size
 If queries are time based and filters are available for the time,
looking up key matches can be very fast

Agenda
 Fail Fast
 What it is
 Redis-backed BloomFilters
 Examples

Fail Fast
 The ability to quickly know to NOT do something expensive
 Example: Black-list of IPs
 Think about ways to NOT do some work
 Cost of Redis servers is much less than an RDBMS license or the
cost of a good DB server with storage!

Be careful
 Sometimes the cost of building and maintaining the structures
outweighs the benefit
 Convoluted designs to avoid the database
 Collect Metrics on 'hits' to see if they are any benefit (CodaHale)

Example (naive)
 Build a BF for ads shown to a user (hash on user id and ad id)
 When the user visits, hash their user id and the top ad to display
this hour and set the bits in the BF
 If any were not set, the Population count is incremented and you
display the ad
 If already set, move to the next most important ad.
 Now know total unique views by ad by hour
 Can do total gross with a Redis Hash too!

Example – smarter
 Hash the top 10 ad ids to the user id and parallel request (Pipeline)
 Check the return to see which ones aren't set, submit an update
request and set the population
 2 round trips to check 10 ads.
 (Can also do this in LUA in 1 round trip)

Example – part 2
 Same idea as before, but build the bloom filter for each hour
 When user visits, query last 6 filters in parallel (pipeline!) to see if
they've seen the ad(s).
 Redis TTL on the hourly filter will drop it automatically when it
becomes too old

Example 3
 Collect lots of data about users (such as virtual cows, farm land,
chickens etc.)
 Run a predictive model on the data and identify which special
offers to show the user visits again. Store user ids in a Bloom Filter
 Load the BF into Redis
 Query each time the user logs in and display appropriate offer
 No massive database insert/updates to flag who should see it
 False positive isn't too bad

Example 4 – Query optimization
 Client-side joins
 Ask the Bloom Filter if the user has performed the action (filters
for hour, day, week of year etc.)
 If not, don't even call the data source
 May need to read some extra data due to 'in the last 11 days' but
asking the BF and being told 'no' prevents ANY data source
resources to be used
 What if the BF is lost? Rebuild it from the base events (Hadoop!)

Conclusion
 Redis is a very fast, very simple and very powerful name value
store “Data structure server”
 Bloom Filters have lots of applications when you want to quickly
look up if one of millions of 'things' happened
 Redis-backed BloomFilters make updatable bloom filters trivial to
use
 Think about what you need to know to NOT do an expensive
operation
 Fail fast

References
 Redis.io
 http://en.wikipedia.org/wiki/Bloom_filter
 http://hur.st/bloomfilter?n=4&p=1.0E-20
 https://github.com/Baqend/Orestes-Bloomfilter
 http://www.slideshare.net/chriscurtin
 @ChrisCurtin on twitter
 Github.com/chriscurtin

Redis and Bloom Filters - Atlanta Java Users Group 9/2014

More Related Content

What's hot

Similar to Redis and Bloom Filters - Atlanta Java Users Group 9/2014

Recently uploaded

Redis and Bloom Filters - Atlanta Java Users Group 9/2014