Scalable Data Storage Getting You Down? To The Cloud!

SCALABLE DATA STORAGE
GETTING YOU DOWN?
TO THE CLOUD!
Web 2.0 Expo SF 20

Mike Male, Mike Pcnko, Dek Smh, Paul Lhrop

THE CAST
MIKE MALONE
INFRASTRUCTURE ENGINEER
@MJMALONE

MIKE PANCHENKO
@MIHASYA

DEREK SMITH
@DSMITTS

PAUL LATHROP
OPERATIONS
@GREYTALYN

SIMPLEGEO

We originally began as a mobile
gaming startup, but quickly
discovered that the location services
and infrastructure needed to support
our ideas didn’t exist. So we took
matters into our own hands and
began building it ourselves.

Mt Gaig Joe Stump
CSO & co-founder CTO & co-founder

THE STACK
www
gnop

AWS
RDS

AWS auth/proxy
ELB
HTTP
data centers
... ...
api servers record
storage
reads geocoder

queues reverse
geocoder
GeoIP

pushpin
writes

index
storage

Apache Cassandra

DATABASES
WHAT ARE THEY GOOD FOR?
DATA STORAGE
Durably persist system state
CONSTRAINT MANAGEMENT
Enforce data integrity constraints
EFFICIENT ACCESS
Organize data and implement access methods for efficient
retrieval and summarization

DATA INDEPENDENCE
Data independence shields clients from the details
of the storage system, and data structure

LOGICAL DATA INDEPENDENCE
Clients that operate on a subset of the attributes in a data set should
not be affected later when new attributes are added
PHYSICAL DATA INDEPENDENCE
Clients that interact with a logical schema remain the same despite
physical data structure changes like
• File organization
• Compression
• Indexing strategy

TRANSACTIONAL RELATIONAL
DATABASE SYSTEMS
HIGH DEGREE OF DATA INDEPENDENCE
Logical structure: SQL Data Definition Language
Physical structure: Managed by the DBMS
OTHER GOODIES
They’re theoretically pure, well understood, and mostly
standardized behind a relatively clean abstraction
They provide robust contracts that make it easy to reason
about the structure and nature of the data they contain
They’re ubiquitous, battle hardened, robust, durable, etc.

ACID
These terms are not formally deﬁned - they’re a
framework, not mathematical axioms

ATOMICITY
Either all of a transaction’s actions are visible to another transaction, or none are

CONSISTENCY
Application-specific constraints must be met for transaction to succeed

ISOLATION
Two concurrent transactions will not see one another’s transactions while “in flight”

DURABILITY
The updates made to the database in a committed transaction will be visible to
future transactions

ACID HELPS
ACID is a sort-of-formal contract that makes it
easy to reason about your data, and that’s good

IT DOES SOMETHING HARD FOR YOU
With ACID, you’re guaranteed to maintain a persistent global
state as long as you’ve defined proper constraints and your
logical transactions result in a valid system state

CAP THEOREM
At PODC 2000 Eric Brewer told us there were three
desirable DB characteristics. But we can only have two.

CONSISTENCY
Every node in the system contains the same data (e.g., replicas are
never out of date)

AVAILABILITY
Every request to a non-failing node in the system returns a response

PARTITION TOLERANCE
System properties (consistency and/or availability) hold even when
the system is partitioned and data is lost

CAP THEOREM IN 30 SECONDS

CLIENT SERVER REPLICA


wre


CLIENT SERVER plice REPLICA
wre


wre

ack


wre

aept ack


wre
FAIL!

ni
UNAVAILAB!


wre
FAIL!

aept
CSTT!

ACID HURTS
Certain aspects of ACID encourage (require?)
implementors to do “bad things”

Unfortunately, ANSI SQL’s definition of isolation...
relies in subtle ways on an assumption that a locking scheme is
used for concurrency control, as opposed to an optimistic or
multi-version concurrency scheme. This implies that the
proposed semantics are ill-defined.
Joseph M. Hellerstein and Michael Stonebraker
Anatomy of a Database System

BALANCE
IT’S A QUESTION OF VALUES
For traditional databases CAP consistency is the holy grail: it’s
maximized at the expense of availability and partition
tolerance
At scale, failures happen: when you’re doing something a
million times a second a one-in-a-million failure happens every
second
We’re witnessing the birth of a new religion...
• CAP consistency is a luxury that must be sacrificed at scale in order to
maintain availability when faced with failures

NETWORK INDEPENDENCE
A distributed system must also manage the
network - if it doesn’t, the client has to

CLIENT APPLICATIONS ARE LEFT TO HANDLE
Partitioning data across multiple machines
Working with loosely defined replication semantics
Detecting, routing around, and correcting network and
hardware failures

WHAT’S WRONG
WITH MYSQL..?
TRADITIONAL RELATIONAL DATABASES
They are from an era (er, one of the eras) when Big Iron was
the answer to scaling up
In general, the network was not considered part of the system
NEXT GENERATION DATABASES
Deconstructing, and decoupling the beast
Trying to create a loosely coupled structured storage system
• Something that the current generation of database systems never
quite accomplished

APACHE CASSANDRA
A DISTRIBUTED STRUCTURED STORAGE SYSTEM
EMPHASIZING
Extremely large data sets
High transaction volumes
High value data that necessitates high availability

TO USE CASSANDRA EFFECTIVELY IT HELPS TO
UNDERSTAND WHAT’S GOING ON BEHIND THE SCENES

APACHE CASSANDRA
A DISTRIBUTED HASH TABLE WITH SOME TRICKS
Peer-to-peer architecture with no distinguished nodes, and
therefore no single points of failure
Gossip-based cluster management
Generic distributed data placement strategy maps data to nodes
• Pluggable partitioning
• Pluggable replication strategy
Quorum based consistency, tunable on a per-request basis
Keys map to sparse, multi-dimensional sorted maps
Append-only commit log and SSTables for efficient disk utilization

NETWORK MODEL
DYNAMO INSPIRED
CONSISTENT HASHING
Simple random partitioning mechanism for distribution
Low fuss online rebalancing when operational requirements
change
GOSSIP PROTOCOL
Simple decentralized cluster configuration and fault detection
Core protocol for determining cluster membership and
providing resilience to partial system failure

CONSISTENT HASHING
Improves shortcomings of modulo-based hashing

1
sh(alice) % 3
2
=> 23 % 3
=> 2 3

CONSISTENT HASHING
With modulo hashing, a change in the number of
nodes reshuﬄes the entire data set

1
sh(alice) % 4
2
=> 23 % 4
=> 3 3
4

CONSISTENT HASHING
Instead the range of the hash function is mapped to a
ring, with each node responsible for a segment

0
sh(alice) => 23
84 42

CONSISTENT HASHING
When nodes are added (or removed) most of the data
mappings remain the same

0
sh(alice) => 23
84 42
64

CONSISTENT HASHING
Rebalancing the ring requires a minimal amount of
data shuﬄing

0
sh(alice) => 23
96 32
64

GOSSIP
DISSEMINATES CLUSTER MEMBERSHIP AND
RELATED CONTROL STATE
Gossip is initiated by an interval timer
At each gossip tick a node will
• Randomly select a live node in the cluster, sending it a gossip message
• Attempt to contact cluster members that were previously marked as
down
If the gossip message is unacknowledged for some period of
time (statistically adjusted based on the inter-arrival time of
previous messages) the remote node is marked as down

REPLICATION
REPLICATION FACTOR Determines how many
copies of each piece of data are created in the
system
RF=3
0
sh(alice) => 23
96 32
64

CONSISTENCY MODEL
DYNAMO INSPIRED
QUORUM-BASED CONSISTENCY

W=2
0
wre
sh(alice) => 23
ad 96 32
W+R>N
R=2 64
Cstt

TUNABLE CONSISTENCY
WRITES

ZERO DON’T BOTHER WAITING FOR A RESPONSE

ANY WAIT FOR SOME NODE (NOT NECESSARILY A
REPLICA) TO RESPOND

ONE WAIT FOR ONE REPLICA TO RESPOND

QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND

ALL WAIT FOR ALL N REPLICAS TO RESPOND

TUNABLE CONSISTENCY
READS

ONE WAIT FOR ONE REPLICA TO RESPOND

QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND

ALL WAIT FOR ALL N REPLICAS TO RESPOND

CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF
ANTI-ENTROPY
W=2

wre

fail

CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR Asynchronously checks replicas during
reads and repairs any inconsistencies
HINTED HANDOFF
ANTI-ENTROPY
W=2

wre
ad + ﬁx

CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
wre
plica

CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF Sends failed writes to another node
with a hint to re-replicate when the failed node returns
ANTI-ENTROPY
* ck *
pair

CONSISTENCY MODEL
DYNAMO INSPIRED
READ REPAIR
HINTED HANDOFF
ANTI-ENTROPY Manual repair process where nodes
generate Merkle trees (hash trees) to detect and
repair data inconsistencies

pair

DATA MODEL
BIGTABLE INSPIRED
SPARSE MATRIX it’s a hash-map (associative array):
a simple, versatile data structure
SCHEMA-FREE data model, introduces new freedom
and new responsibilities
COLUMN FAMILIES blend row-oriented and column-
oriented structure, providing a high level mechanism
for clients to manage on-disk and inter-node data
locality

DATA MODEL
TERMINOLOGY
KEYSPACE A named collection of column families
(similar to a “database” in MySQL) you only need one and
you can mostly ignore it
COLUMN FAMILY A named mapping of keys to rows
ROW A named sorted map of columns or supercolumns
COLUMN A <name, value, timestamp> triple

SUPERCOLUMN A named collection of columns, for
people who want to get fancy

DATA MODEL
{
column family
“users”: { key
“alice”: {
“city”: [“St. Louis”, 1287040737182],
columns
row (name, value, timestamp)
“name”: [“Alice” 1287080340940],
,
},
...
},
“locations”: {
},
...
}

IT’S A DISTRIBUTED HASH TABLE
WITH A TWIST...
COLUMNS IN ARE STORED TOGETHER ON ONE NODE,
IDENTIFIED BY <keyspace, key>

{
column family
“users”: {
key
“
alice”: {
“city”: [“St. Louis” 1287040737182],
,
columns
“name”: [“Alice” 1287080340940],
,
},
...
},

}
...
bob
alice s3b
3e8

HASH TABLE
SUPPORTED QUERIES

EXACT MATCH
RANGE
PROXIMITY
ANYTHING THAT’S NOT
EXACT MATCH

COLUMNS
SUPPORTED QUERIES

EXACT MATCH
{
RANGE “users”: {
“alice”: {
“city”: [“St. Louis”, 1287040737182],
PROXIMITY “friend-1”: [“Bob” 1287080340940],
,
friends “friend-2”: [“Joe”, 1287080340940],
“friend-3”: [“Meg” 1287080340940],
,
“name”: [“Alice” 1287080340940],
,
},
...
}
}

LOG-STRUCTURED MERGE
MEMTABLES are in memory data structures that
contain newly written data

COMMIT LOGS are append only files where new
data is durably written

SSTABLES are serialized memtables, persisted to
disk

COMPACTION periodically merges multiple
memtables to improve system performance

CASSANDRA
CONCEPTUAL SUMMARY...
IT’S A DISTRIBUTED HASH TABLE
Gossip based peer-to-peer “ring” with no distinguished nodes and no
single point of failure
Consistent hashing distributes workload and simple replication
strategy for fault tolerance and improved throughput
WITH TUNABLE CONSISTENCY
Based on quorum protocol to ensure consistency
And simple repair mechanisms to stay available during partial system
failures
AND A SIMPLE, SCHEMA-FREE DATA MODEL
It’s just a key-value store
Whose values are multi-dimensional sorted map

ADVANCED CASSANDRA
- A case study -
SPATIAL DATA IN A DHT

A FIRST PASS
THE ORDER PRESERVING PARTITIONER
CASSANDRA’S PARTITIONING
STRATEGY IS PLUGGABLE
Partitioner maps keys to nodes
Random partitioner destroys locality by hashing
Order preserving partitioner retains locality, storing
keys in natural lexicographical order around ring z a
alice
a
bob
u h
sam m

ORDER PRESERVING PARTITIONER

EXACT MATCH
RANGE
On a single dimension
? PROXIMITY

SPATIAL DATA
IT’S INHERENTLY MULTIDIMENSIONAL

2 x 2, 2

1

1 2

DIMENSIONALITY REDUCTION
WITH SPACE-FILLING CURVES

1 2

3 4

GEOHASH
SIMPLE TO COMPUTE
Interleave the bits of decimal coordinates
(equivalent to binary encoding of pre-order
traversal!)
Base32 encode the result
AWESOME CHARACTERISTICS
Arbitrary precision
Human readable
Sorts lexicographically

01101
e

DATA MODEL
{
“record-index”: {
key
<geohash>:<id>
“9yzgcjn0:moonrise hotel”: {
“”: [“”, 1287040737182],
},
...
},
“records”: {
“moonrise hotel”: {
“latitude”: [“38.6554420”, 1287040737182],
“longitude”: [“-90.2992910”, 1287040737182],
...
}
}
}

BOUNDING BOX
E.G., MULTIDIMENSIONAL RANGE

Gie ﬆuﬀ  bg box! Gie 2  3

1 2

3 4

Gie 4  5

SPATIAL DATA
STILL MULTIDIMENSIONAL
DIMENSIONALITY REDUCTION ISN’T PERFECT
Clients must
• Pre-process to compose multiple queries
• Post-process to filter and merge results
Degenerate cases can be bad, particularly for nearest-neighbor
queries

Z-CURVE LOCALITY

x
x

Z-CURVE LOCALITY

x
o o o x
o
o o
o

THE WORLD
IS NOT BALANCED

Credit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive

TOO MUCH LOCALITY

1 2

SAN FRANCISCO

3 4

TOO MUCH LOCALITY

1 2 I’m sad.

SAN FRANCISCO

3 4

TOO MUCH LOCALITY
I’m b.

1 2 I’m sad. Me o.

SAN FRANCISCO

3 4

Let’s py xbox.

HELLO, DRAWING BOARD
SURVEY OF DISTRIBUTED P2P INDEXING
An overlay-dependent index works directly with nodes of the
peer-to-peer network, defining its own overlay
An over-DHT index overlays a more sophisticated data
structure on top of a peer-to-peer distributed hash table

ANOTHER LOOK AT POSTGIS
MIGHT WORK, BUT
The relational transaction management system (which we’d
want to change) and access methods (which we’d have to
change) are tightly coupled (necessarily?) to other parts of
the system
Could work at a higher level and treat PostGIS as a black box
• Now we’re back to implementing a peer-to-peer network with failure
recovery, fault detection, etc... and Cassandra already had all that.
• It’s probably clear by now that I think these problems are more
difficult than actually storing structured data on disk

DATA MODEL
{
“record-index”: {
“layer-name:37 .875, -90:40.25, -101.25”: {
“38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182],
,
...
},
},
“record-index-meta”: {
“layer-name:37 .875, -90:40.25, -101.25”: {
“split”: [“false”, 1287040737182],
}
“layer-name: 37 .875, -90:42.265, -101.25” {
“split”: [“true”, 1287040737182],
“child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182]
,
“child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182]
,
}
}
}

SPLITTING
IT’S PRETTY MUCH JUST A CONCURRENT TREE
Splitting shouldn’t lock the tree for reads or writes and failures
shouldn’t cause corruption
• Splits are optimistic, idempotent, and fail-forward
• Instead of locking, writes are replicated to the splitting node and the
relevant child[ren] while a split operation is taking place
• Cleanup occurs after the split is completed and all interested nodes are
aware that the split has occurred
• Cassandra writes are idempotent, so splits are too - if a split fails, it is
simply be retried
Split size: A Tunable knob for balancing locality and distributedness
The other hard problem with concurrent trees is rebalancing - we
just don’t do it! (more on this later)

THE ROOT IS HOT
MIGHT BE A DEAL BREAKER
For a tree to be useful, it has to be traversed
• Typically, tree traversal starts at the root
• Root is the only discoverable node in our tree
Traversing through the root meant reading the root for every
read or write below it - unacceptable
• Lots of academic solutions - most promising was a skip graph, but
that required O(n log(n)) data - also unacceptable
• Minimum tree depth was propsed, but then you just get multiple hot-
spots at your minimum depth nodes

BACK TO THE BOOKS
LOTS OF ACADEMIC WORK ON THIS TOPIC
But academia is obsessed with provable, deterministic,
asymptotically optimal algorithms
And we only need something that is probably fast enough
most of the time (for some value of “probably” and “most of
the time”)
• And if the probably good enough algorithm is, you know... tractable...
one might even consider it qualitatively better!

- THE ROOT -
A HOT SPOT AND A SPOF

THINKING HOLISTICALLY
WE OBSERVED THAT
Once a node in the tree exists, it doesn’t go away
Node state may change, but that state only really matters
locally - thinking a node is a leaf when it really has children is
not fatal
SO... WHAT IF WE JUST CACHED NODES THAT
WERE OBSERVED IN THE SYSTEM!?

CACHE IT
STUPID SIMPLE SOLUTION
Keep an LRU cache of nodes that have been traversed
Start traversals at the most selective relevant node
If that node doesn’t satisfy you, traverse up the tree
Along with your result set, return a list of nodes that were
traversed so the caller can add them to its cache

TRAVERSAL
NEAREST NEIGHBOR

o o
o xo

KEY CHARACTERISTICS
PERFORMANCE
Best case on the happy path (everything cached) has zero
read overhead
Worst case, with nothing cached, O(log(n)) read overhead
RE-BALANCING SEEMS UNNECESSARY!
Makes worst case more worser, but so far so good

DISTRIBUTED TREE
SUPPORTED QUERIES

EXACT MATCH
RANGE
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF

DISTRIBUTED TREE
SUPPORTED QUERIES
MUL
EXACT MATCH DI P
NS 
RANGE
NS!
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF

THE BIRDS ‘N THE BEES
ELB

gate gate

service service

cass

worker pool worker pool

index

ELB

gate

service

cass

worker pool

index

ELB load bag; AWS svice

gate

service

cass

worker pool

index


gate auicn; fwdg

service

cass

worker pool

index



service buss logic - bic validn

cass

worker pool

index




cass cd ﬆage

worker pool

index





worker pool buss logic - ﬆage/xg

index





worker pool buss logic - ﬆage/xg

index awome sauce f qryg

ELB
•Traﬃc management
•Control which AZs are serving traﬃc
•Upgrades without downtime
•Able to remove an AZ, upgrade, test,
replace

•API-level failure scenarios
•Periodically runs healthchecks on nodes
•Removes nodes that fail

GATE
•Basic auth
•HTTP proxy to speciﬁc services
•Services are independent of one another
•Auth is Decoupled from business logic
•First line of defense
•Very fast, very cheap
•Keeps services from being overwhelmed
by poorly authenticated requests

RABBITMQ
•Decouple accepting writes from
performing the heavy lifting
•Don’t block client while we write to db/
index
•Flexibility in the event of degradation
further down the stack
•Queues can hold a lot, and can keep
accepting writes throughout incident
•Heterogenous consumers - pass the same
message through multiple code paths easily

EC2
• Security groups
• Static IPs
• Choose your data center
• Choose your instance type
• On-demand vs. Reserved

ELASTIC BLOCK SUPPORT
• Storage devices that can be anywhere from 1GB to 1TB
• Snapshotting
• Automatic replication
• Mobility

ELASTIC LOAD BALANCING
• Distribute incoming traﬃc
• Automatic scaling
• Health checks

SIMPLE STORAGE SERVICE
• File sizes can be up to 5TBs
• Unlimited amount of ﬁles
• Individual read/write access credentials

RELATIONAL DATABASE SERVICE
• MySQL in the cloud
• Manages replication
• Specify instance types based on need
• Snapshots

WHY IS THIS NECESSARY?
• Easier DevOps integration
• Reusable modules
• Infrastructure as code
• One word: WINNING

SAMPLE MANIFEST
# /root/learning-manifests/apache2.pp
package {
'apache2':
ensure => present;
}

file {
'/etc/apache2/apache2.conf':
ensure => file,
mode => 600,
notify => Service[‘apache2’],
source => '/root/learning-manifests/apache2.conf',
}

service {
'apache2':
ensure => running,
enable => true,
subscribe => File['/etc/apache2/apache2.conf'],
}

GET ‘ER DONE
• Revision control
• Automate build process
• Automate testing process
• Automate deployment

Local code changes should result in production
deployments.

DON’T FORGET TO DEBIANIZE
• All codebases must be debianized
• Open source project not debianized yet, fork the repo and
do it yourself!
• Take the time to teach others
• Debian directories can easily be reused after a simple search
and replace

REPOMAN
HTTPS://GITHUB.COM/SYNACK/REPOMAN.GIT

repoman upload myrepo sg-example.tar.gz

repoman promote development/sg-example staging

MAINTAINING MULTIPLE
ENVIRONMENTS
• Run unit tests in a development environment
• Promote to staging
• Run system tests in a staging environment
• Run consumption tests in a staging environment
• Promote to production

Congratz, you have now just automated yourself
out of a job.

THE MONEY MAKER

Github Plugin

TYING IT ALL TOGETHER

TERMINAL

FLUME
Flume is a distributed, reliable and available
service for eﬃciently collecting, aggregating and
moving large amounts of log data.

syslog on steriods

AGENTS
Physical Host Logical Nodes Source and Sink

twitter_stream twitter(“dsmitts”, “mypassword”, “url”)
agentSink(35853)

tail(“/var/log/nginx/access.log”)
i-192df98 tail agentSink(35853)

collectorSource(35853)
hdfs_writer collectorSink("hdfs://namenode.sg.com/bogus/", "logs")

RELIABILITY

END-TO-END

STORE ON FAILURE

BEST EFFORT

GETTIN’ JIGGY WIT IT
Custom Decorators

PERSONAL EXPERIENCES
• #ﬂume
• Automation was gnarly
• Its never a good day when Eclipse is involved
• Resource hog (at ﬁrst)

We’ buildg kick-s ols f vops  x,
tpt,  csume da cnect  a locn

MARCH 2011

Scalable Data Storage Getting You Down? To The Cloud!

More Related Content

Similar to Scalable Data Storage Getting You Down? To The Cloud!

Recently uploaded

Scalable Data Storage Getting You Down? To The Cloud!