This was a three hour workshop given at the 2011 Web 2.0 Expo in San Francisco. Due to the length of the presentation and the number of presenters, portions of the slide deck may appear disjoint without the accompanying narrative.
Abstract: "The hype cycle is at a high for cloud computing, distributed “NoSQL” data storage, and high availability map-reducing eventually consistent distributed data processing frameworks everywhere. Back in the real world we know that these technologies aren’t a cure-all. But they’re not worthless, either. We’ll take a look behind the curtains and share some of our experiences working with these systems in production at SimpleGeo.
Our stack consists of Cassandra, HBase, Hadoop, Flume, node.js, rabbitmq, and Puppet. All running on Amazon EC2. Tying these technologies together has been a challenge, but the result really is worth the work. The rotten truth is that our ops guys still wake up in the middle of the night sometimes, and our engineers face new and novel challenges. Let us share what’s keeping us busy—the folks working in the wee hours of the morning—in the hopes that you won’t have to do so yourself."
Scalable Data Storage Getting You Down? To The Cloud!
1. SCALABLE DATA STORAGE
GETTING YOU DOWN?
TO THE CLOUD!
Web 2.0 Expo SF 20
Mike Male, Mike Pcnko, Dek Smh, Paul Lhrop
2. THE CAST
MIKE MALONE
INFRASTRUCTURE ENGINEER
@MJMALONE
MIKE PANCHENKO
INFRASTRUCTURE ENGINEER
@MIHASYA
DEREK SMITH
INFRASTRUCTURE ENGINEER
@DSMITTS
PAUL LATHROP
OPERATIONS
@GREYTALYN
3. SIMPLEGEO
We originally began as a mobile
gaming startup, but quickly
discovered that the location services
and infrastructure needed to support
our ideas didn’t exist. So we took
matters into our own hands and
began building it ourselves.
Mt Gaig Joe Stump
CSO & co-founder CTO & co-founder
4. THE STACK
www
gnop
AWS
RDS
AWS auth/proxy
ELB
HTTP
data centers
... ...
api servers record
storage
reads geocoder
queues reverse
geocoder
GeoIP
pushpin
writes
index
storage
Apache Cassandra
6. DATABASES
WHAT ARE THEY GOOD FOR?
DATA STORAGE
Durably persist system state
CONSTRAINT MANAGEMENT
Enforce data integrity constraints
EFFICIENT ACCESS
Organize data and implement access methods for efficient
retrieval and summarization
7. DATA INDEPENDENCE
Data independence shields clients from the details
of the storage system, and data structure
LOGICAL DATA INDEPENDENCE
Clients that operate on a subset of the attributes in a data set should
not be affected later when new attributes are added
PHYSICAL DATA INDEPENDENCE
Clients that interact with a logical schema remain the same despite
physical data structure changes like
• File organization
• Compression
• Indexing strategy
8. TRANSACTIONAL RELATIONAL
DATABASE SYSTEMS
HIGH DEGREE OF DATA INDEPENDENCE
Logical structure: SQL Data Definition Language
Physical structure: Managed by the DBMS
OTHER GOODIES
They’re theoretically pure, well understood, and mostly
standardized behind a relatively clean abstraction
They provide robust contracts that make it easy to reason
about the structure and nature of the data they contain
They’re ubiquitous, battle hardened, robust, durable, etc.
9. ACID
These terms are not formally defined - they’re a
framework, not mathematical axioms
ATOMICITY
Either all of a transaction’s actions are visible to another transaction, or none are
CONSISTENCY
Application-specific constraints must be met for transaction to succeed
ISOLATION
Two concurrent transactions will not see one another’s transactions while “in flight”
DURABILITY
The updates made to the database in a committed transaction will be visible to
future transactions
10. ACID HELPS
ACID is a sort-of-formal contract that makes it
easy to reason about your data, and that’s good
IT DOES SOMETHING HARD FOR YOU
With ACID, you’re guaranteed to maintain a persistent global
state as long as you’ve defined proper constraints and your
logical transactions result in a valid system state
11. CAP THEOREM
At PODC 2000 Eric Brewer told us there were three
desirable DB characteristics. But we can only have two.
CONSISTENCY
Every node in the system contains the same data (e.g., replicas are
never out of date)
AVAILABILITY
Every request to a non-failing node in the system returns a response
PARTITION TOLERANCE
System properties (consistency and/or availability) hold even when
the system is partitioned and data is lost
14. CAP THEOREM IN 30 SECONDS
CLIENT SERVER plice REPLICA
wre
15. CAP THEOREM IN 30 SECONDS
CLIENT SERVER plice REPLICA
wre
ack
16. CAP THEOREM IN 30 SECONDS
CLIENT SERVER plice REPLICA
wre
aept ack
17. CAP THEOREM IN 30 SECONDS
CLIENT SERVER REPLICA
wre
FAIL!
ni
UNAVAILAB!
18. CAP THEOREM IN 30 SECONDS
CLIENT SERVER REPLICA
wre
FAIL!
aept
CSTT!
19. ACID HURTS
Certain aspects of ACID encourage (require?)
implementors to do “bad things”
Unfortunately, ANSI SQL’s definition of isolation...
relies in subtle ways on an assumption that a locking scheme is
used for concurrency control, as opposed to an optimistic or
multi-version concurrency scheme. This implies that the
proposed semantics are ill-defined.
Joseph M. Hellerstein and Michael Stonebraker
Anatomy of a Database System
20. BALANCE
IT’S A QUESTION OF VALUES
For traditional databases CAP consistency is the holy grail: it’s
maximized at the expense of availability and partition
tolerance
At scale, failures happen: when you’re doing something a
million times a second a one-in-a-million failure happens every
second
We’re witnessing the birth of a new religion...
• CAP consistency is a luxury that must be sacrificed at scale in order to
maintain availability when faced with failures
21. NETWORK INDEPENDENCE
A distributed system must also manage the
network - if it doesn’t, the client has to
CLIENT APPLICATIONS ARE LEFT TO HANDLE
Partitioning data across multiple machines
Working with loosely defined replication semantics
Detecting, routing around, and correcting network and
hardware failures
22. WHAT’S WRONG
WITH MYSQL..?
TRADITIONAL RELATIONAL DATABASES
They are from an era (er, one of the eras) when Big Iron was
the answer to scaling up
In general, the network was not considered part of the system
NEXT GENERATION DATABASES
Deconstructing, and decoupling the beast
Trying to create a loosely coupled structured storage system
• Something that the current generation of database systems never
quite accomplished
24. APACHE CASSANDRA
A DISTRIBUTED STRUCTURED STORAGE SYSTEM
EMPHASIZING
Extremely large data sets
High transaction volumes
High value data that necessitates high availability
TO USE CASSANDRA EFFECTIVELY IT HELPS TO
UNDERSTAND WHAT’S GOING ON BEHIND THE SCENES
25. APACHE CASSANDRA
A DISTRIBUTED HASH TABLE WITH SOME TRICKS
Peer-to-peer architecture with no distinguished nodes, and
therefore no single points of failure
Gossip-based cluster management
Generic distributed data placement strategy maps data to nodes
• Pluggable partitioning
• Pluggable replication strategy
Quorum based consistency, tunable on a per-request basis
Keys map to sparse, multi-dimensional sorted maps
Append-only commit log and SSTables for efficient disk utilization
26. NETWORK MODEL
DYNAMO INSPIRED
CONSISTENT HASHING
Simple random partitioning mechanism for distribution
Low fuss online rebalancing when operational requirements
change
GOSSIP PROTOCOL
Simple decentralized cluster configuration and fault detection
Core protocol for determining cluster membership and
providing resilience to partial system failure
32. GOSSIP
DISSEMINATES CLUSTER MEMBERSHIP AND
RELATED CONTROL STATE
Gossip is initiated by an interval timer
At each gossip tick a node will
• Randomly select a live node in the cluster, sending it a gossip message
• Attempt to contact cluster members that were previously marked as
down
If the gossip message is unacknowledged for some period of
time (statistically adjusted based on the inter-arrival time of
previous messages) the remote node is marked as down
35. TUNABLE CONSISTENCY
WRITES
ZERO DON’T BOTHER WAITING FOR A RESPONSE
ANY WAIT FOR SOME NODE (NOT NECESSARILY A
REPLICA) TO RESPOND
ONE WAIT FOR ONE REPLICA TO RESPOND
QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND
ALL WAIT FOR ALL N REPLICAS TO RESPOND
36. TUNABLE CONSISTENCY
READS
ONE WAIT FOR ONE REPLICA TO RESPOND
QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND
ALL WAIT FOR ALL N REPLICAS TO RESPOND
38. CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR Asynchronously checks replicas during
reads and repairs any inconsistencies
HINTED HANDOFF
ANTI-ENTROPY
W=2
wre
ad + fix
39. CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
wre
plica
40. CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
* ck *
pair
41. CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF
ANTI-ENTROPY Manual repair process where nodes
generate Merkle trees (hash trees) to detect and
repair data inconsistencies
pair
42. DATA MODEL
BIGTABLE INSPIRED
SPARSE MATRIX it’s a hash-map (associative array):
a simple, versatile data structure
SCHEMA-FREE data model, introduces new freedom
and new responsibilities
COLUMN FAMILIES blend row-oriented and column-
oriented structure, providing a high level mechanism
for clients to manage on-disk and inter-node data
locality
43. DATA MODEL
TERMINOLOGY
KEYSPACE A named collection of column families
(similar to a “database” in MySQL) you only need one and
you can mostly ignore it
COLUMN FAMILY A named mapping of keys to rows
ROW A named sorted map of columns or supercolumns
COLUMN A <name, value, timestamp> triple
SUPERCOLUMN A named collection of columns, for
people who want to get fancy
45. IT’S A DISTRIBUTED HASH TABLE
WITH A TWIST...
COLUMNS IN ARE STORED TOGETHER ON ONE NODE,
IDENTIFIED BY <keyspace, key>
{
column family
“users”: {
key
“
alice”: {
“city”: [“St. Louis” 1287040737182],
,
columns
“name”: [“Alice” 1287080340940],
,
},
...
},
}
...
bob
alice s3b
3e8
48. LOG-STRUCTURED MERGE
MEMTABLES are in memory data structures that
contain newly written data
COMMIT LOGS are append only files where new
data is durably written
SSTABLES are serialized memtables, persisted to
disk
COMPACTION periodically merges multiple
memtables to improve system performance
49. CASSANDRA
CONCEPTUAL SUMMARY...
IT’S A DISTRIBUTED HASH TABLE
Gossip based peer-to-peer “ring” with no distinguished nodes and no
single point of failure
Consistent hashing distributes workload and simple replication
strategy for fault tolerance and improved throughput
WITH TUNABLE CONSISTENCY
Based on quorum protocol to ensure consistency
And simple repair mechanisms to stay available during partial system
failures
AND A SIMPLE, SCHEMA-FREE DATA MODEL
It’s just a key-value store
Whose values are multi-dimensional sorted map
51. A FIRST PASS
THE ORDER PRESERVING PARTITIONER
CASSANDRA’S PARTITIONING
STRATEGY IS PLUGGABLE
Partitioner maps keys to nodes
Random partitioner destroys locality by hashing
Order preserving partitioner retains locality, storing
keys in natural lexicographical order around ring z a
alice
a
bob
u h
sam m
57. GEOHASH
SIMPLE TO COMPUTE
Interleave the bits of decimal coordinates
(equivalent to binary encoding of pre-order
traversal!)
Base32 encode the result
AWESOME CHARACTERISTICS
Arbitrary precision
Human readable
Sorts lexicographically
01101
e
60. SPATIAL DATA
STILL MULTIDIMENSIONAL
DIMENSIONALITY REDUCTION ISN’T PERFECT
Clients must
• Pre-process to compose multiple queries
• Post-process to filter and merge results
Degenerate cases can be bad, particularly for nearest-neighbor
queries
71. HELLO, DRAWING BOARD
SURVEY OF DISTRIBUTED P2P INDEXING
An overlay-dependent index works directly with nodes of the
peer-to-peer network, defining its own overlay
An over-DHT index overlays a more sophisticated data
structure on top of a peer-to-peer distributed hash table
72. ANOTHER LOOK AT POSTGIS
MIGHT WORK, BUT
The relational transaction management system (which we’d
want to change) and access methods (which we’d have to
change) are tightly coupled (necessarily?) to other parts of
the system
Could work at a higher level and treat PostGIS as a black box
• Now we’re back to implementing a peer-to-peer network with failure
recovery, fault detection, etc... and Cassandra already had all that.
• It’s probably clear by now that I think these problems are more
difficult than actually storing structured data on disk
82. SPLITTING
IT’S PRETTY MUCH JUST A CONCURRENT TREE
Splitting shouldn’t lock the tree for reads or writes and failures
shouldn’t cause corruption
• Splits are optimistic, idempotent, and fail-forward
• Instead of locking, writes are replicated to the splitting node and the
relevant child[ren] while a split operation is taking place
• Cleanup occurs after the split is completed and all interested nodes are
aware that the split has occurred
• Cassandra writes are idempotent, so splits are too - if a split fails, it is
simply be retried
Split size: A Tunable knob for balancing locality and distributedness
The other hard problem with concurrent trees is rebalancing - we
just don’t do it! (more on this later)
83. THE ROOT IS HOT
MIGHT BE A DEAL BREAKER
For a tree to be useful, it has to be traversed
• Typically, tree traversal starts at the root
• Root is the only discoverable node in our tree
Traversing through the root meant reading the root for every
read or write below it - unacceptable
• Lots of academic solutions - most promising was a skip graph, but
that required O(n log(n)) data - also unacceptable
• Minimum tree depth was propsed, but then you just get multiple hot-
spots at your minimum depth nodes
84. BACK TO THE BOOKS
LOTS OF ACADEMIC WORK ON THIS TOPIC
But academia is obsessed with provable, deterministic,
asymptotically optimal algorithms
And we only need something that is probably fast enough
most of the time (for some value of “probably” and “most of
the time”)
• And if the probably good enough algorithm is, you know... tractable...
one might even consider it qualitatively better!
87. THINKING HOLISTICALLY
WE OBSERVED THAT
Once a node in the tree exists, it doesn’t go away
Node state may change, but that state only really matters
locally - thinking a node is a leaf when it really has children is
not fatal
SO... WHAT IF WE JUST CACHED NODES THAT
WERE OBSERVED IN THE SYSTEM!?
88. CACHE IT
STUPID SIMPLE SOLUTION
Keep an LRU cache of nodes that have been traversed
Start traversals at the most selective relevant node
If that node doesn’t satisfy you, traverse up the tree
Along with your result set, return a list of nodes that were
traversed so the caller can add them to its cache
92. KEY CHARACTERISTICS
PERFORMANCE
Best case on the happy path (everything cached) has zero
read overhead
Worst case, with nothing cached, O(log(n)) read overhead
RE-BALANCING SEEMS UNNECESSARY!
Makes worst case more worser, but so far so good
96. THE BIRDS ‘N THE BEES
ELB
gate gate
service service
cass
worker pool worker pool
index
97. THE BIRDS ‘N THE BEES
ELB
gate
service
cass
worker pool
index
98. THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
gate
service
cass
worker pool
index
99. THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
gate auicn; fwdg
service
cass
worker pool
index
100. THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
gate auicn; fwdg
service buss logic - bic validn
cass
worker pool
index
101. THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
gate auicn; fwdg
service buss logic - bic validn
cass cd stage
worker pool
index
102. THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
gate auicn; fwdg
service buss logic - bic validn
cass cd stage
worker pool buss logic - stage/xg
index
103. THE BIRDS ‘N THE BEES
ELB load bag; AWS svice
gate auicn; fwdg
service buss logic - bic validn
cass cd stage
worker pool buss logic - stage/xg
index awome sauce f qryg
104. ELB
•Traffic management
•Control which AZs are serving traffic
•Upgrades without downtime
•Able to remove an AZ, upgrade, test,
replace
•API-level failure scenarios
•Periodically runs healthchecks on nodes
•Removes nodes that fail
105. GATE
•Basic auth
•HTTP proxy to specific services
•Services are independent of one another
•Auth is Decoupled from business logic
•First line of defense
•Very fast, very cheap
•Keeps services from being overwhelmed
by poorly authenticated requests
106. RABBITMQ
•Decouple accepting writes from
performing the heavy lifting
•Don’t block client while we write to db/
index
•Flexibility in the event of degradation
further down the stack
•Queues can hold a lot, and can keep
accepting writes throughout incident
•Heterogenous consumers - pass the same
message through multiple code paths easily
119. GET ‘ER DONE
• Revision control
• Automate build process
• Automate testing process
• Automate deployment
Local code changes should result in production
deployments.
120. DON’T FORGET TO DEBIANIZE
• All codebases must be debianized
• Open source project not debianized yet, fork the repo and
do it yourself!
• Take the time to teach others
• Debian directories can easily be reused after a simple search
and replace
123. MAINTAINING MULTIPLE
ENVIRONMENTS
• Run unit tests in a development environment
• Promote to staging
• Run system tests in a staging environment
• Run consumption tests in a staging environment
• Promote to production
Congratz, you have now just automated yourself
out of a job.
126. FLUME
Flume is a distributed, reliable and available
service for efficiently collecting, aggregating and
moving large amounts of log data.
syslog on steriods