Notes:
• Not limiting ourselves to current tooling
• Reasonable variations of existing tooling
are acceptable
• Interested in what’s fundamentally possible
Approach #1
• Use Key->Set database
• Key = [URL, hour bucket]
• Value = Set of UserIDs
Approach #1
• Queries:
• Get all sets for all hours in range of
query
• Union sets together
• Compute count of merged set
Approach #1
• Lot of database lookups for large ranges
• Potentially a lot of items in sets, so lots of
work to merge/count
• Database will use a lot of space
Approach #2
• Use Key->HyperLogLog database
• Key = [URL, hour bucket]
• Value = HyperLogLog structure
Approach #2
• Queries:
• Get all HyperLogLog structures for all
hours in range of query
• Merge structures together
• Retrieve count from merged structure
Approach #2
• Much more efficient use of storage
• Less work at query time
• Mild accuracy tradeoff
Approach #3
• Use Key->HyperLogLog database
• Key = [URL, bucket, granularity]
• Value = HyperLogLog structure
Approach #3
• Queries:
• Compute minimal number of database
lookups to satisfy range
• Get all HyperLogLog structures in range
• Merge structures together
• Retrieve count from merged structure
Approach #3
• All benefits of #2
• Minimal number of lookups for any range,
so less variation in latency
• Minimal increase in storage
• Requires more work at write time
Approach #2
• [URL, bucket] -> Set of UserIDs
• Like Approach 1, incrementally normalize
UserId’s
• UserID -> PersonID
Approach #2
• Query:
• Retrieve all UserID sets for range
• Merge sets together
• Convert UserIDs -> PersonIDs to
produce new set
• Get count of new set
Attempt 1:
• Maintain index from UserID -> PersonID
• When receive A <-> B:
• Find what they’re each normalized to,
and transitively normalize all reachable
IDs to “smallest” val
Attempt 2:
• UserID -> PersonID
• PersonID -> Set of UserIDs
• When receive A <-> B
• Find what they’re each normalized to, and
choose one for both to be normalized to
• Update all UserID’s in both normalized sets
ID Name
Location
ID
1 Sally 3
2 George 1
3 Bob 3
Location
ID
City State Population
1 New York NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Normalized schema
Normalization vs
Denormalization
ID Name Location ID City State
1 Sally 3 Chicago IL
2 George 1 New York NY
3 Bob 3 Chicago IL
Location ID City State Population
1 New York NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Denormalized schema
Approach #1
• Use the exact same approach as we did in
fully incremental implementation
• Query performance only degraded for
recent buckets
• e.g., “last month” range computes vast
majority of query from efficient batch
indexes
Approach #1
• Relatively small number of buckets in
realtime layer
• So not that much effect on storage costs
Approach #1
• Complexity of realtime layer is softened by
existence of batch layer
• Batch layer continuously overrides realtime
layer, so mistakes are auto-fixed
Approach #1
• Still going to be a lot of work to implement
this realtime layer
• Recent buckets with lots of uniques will
still cause bad query performance
• No way to apply recent equivs to batch
views without restructuring batch views
Incremental compaction
• Databases write to write-ahead log before
modifying disk and memory indexes
• Need to occasionally compact the log and
indexes
Incremental compaction
• Notorious for causing huge, sudden
changes in performance
• Machines can seem locked up
• Necessitated by random writes
• Extremely complex to deal with
More Complexity
• Dealing with CAP / eventual consistency
• “Call Me Maybe” blog posts found data loss
problems in many popular databases
• Redis
• Cassandra
• ElasticSearch
Lambda Architecture
• This is most basic form of it
• Many variants of it incorporating more
and/or different kinds of layers
Editor's Notes
clear up confusion around it. lambda architecture addresses a lot of nasty, fundamental complexities that isn’t talked about enough
most of talk won’t even talk about LA, we’ll work on an example problem and you’ll see LA naturally emerge
this isn’t even capable of solving the problem we’re going to look at
i want this talk to be interactive...
going deep into technical details
please do not hesitate to jump in with any questions
uniques for just hour 1 = 3
uniques for hours 1 and 2 = 3
uniques for 1 to 3 = 5
uniques for 2-4 = 4
synchronous
asynchronous
characterized by maintaining state incrementally as data comes in and serving queries off of that same state
1 KB to estimate size up to 1B with only 2% error
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t require fetching the entire set
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t require fetching the entire set
example: 1 month there are ~720 hours, 30 days, 4 weeks, 1 month... adding all granularities makes 755 stored values total instead of 720 values, only a 4.8% increase in storage
except now userids should be normalized, so if there’s equiv that user only appears once even if under multiple ids
equiv can change ANY or ALL buckets in the past
will get back to incrementally updating userids
will get back to incrementally updating userids
offload a lot of the work to read time
this is still a lot of work at read time
overall
if using distributed database to store indexes and computing everything concurrently
when receive equivs for 4&lt;-&gt;3 and 3&lt;-&gt;1 at same time, will need some sort of locking so they don’t step on each other
e.g. granularities, the 2 indexes for user id normalization... we know it’s a bad idea to store the same thing in multiple places... opens up possibility of them getting out of sync if you don’t handle every case perfectly
If you have a bug that accidentally sets the second value of all equivs to 1, you’re in trouble
even the version without equivs suffers from these problems
2 functions: produce water of a certain strength, and produce water of a certain temperature
faucet on left gives you “hot” and “cold” inputs which each affect BOTH outputs - complex to use
faucet on right gives you independent “heat” and “strength” inputs, so SIMPLE to use
neither is very complicated
so just a quick overview of denormalization, here’s a schema that stores user information and location information
each is in its own table, and a user’s location is a reference to a row in the location table
this is pretty standard relational database stuff
now let’s say a really common query is getting the city and state a person lives in
to do this you have to join the tables together as part of your query
you might find joins are too expensive, they use too many resources
so you denormalize the schema for performance
you redundantly store the city and state in the users table to make that query faster, cause now it doesn’t require a join
now obviously, this sucks. the same data is now stored in multiple places, which we all know is a bad idea
whenever you need to change something about a location you need to change it everywhere it’s stored
but since people make mistakes, inevitably things become inconsistent
but you have no choice, you want to normalize, but you have to denormalize for performance
i hope you are looking at this and asking the question...
still have to compute uniques over time and deal with the equivs problem
how are we better off than before?
options for taking different approaches to problem without having to sacrifice too much
people say it does “key/value”, so I can use it when I need key/value operations... and they stop there
can’t treat it as a black box, that doesn’t tell the full story
some of his tests was seeing over 30% data loss during partitions
major operational simplification to not require random writes
i’m not saying you can’t make a database that does incremental compaction and deals with the other complexities of random writes well, but it’s clearly a fundamental complexity, and i feel it’s better to not have to deal with it at all
remember, we’re talking about what’s POSSIBLE, not what currently exists
my experience with elephantdb
Does not avoid any of the complexities of massive distributed r/w databases
Does not avoid any of the complexities of massive distributed r/w databases or dealing with eventual consistency
everything i’ve talked about completely generalizes, applies to both AP and CP architectures