A look at Dynamo-based systems: the architectural principles, use cases and requirements; where they differ from relational databases; and where they are going.
9. Global access Multiple machines Multiple datacenters
Scale to peak loads easily Tolerance of continuous failure
10.
11. Traditionally production systems store their
state in relational databases. For many of the
more common usage patterns of state
persistence, however, a relational database is
a solution that is far from ideal.
Most of these services only store and retrieve
data by primary key and do not require the
complex querying and management
functionality offered by an RDBMS.
This excess functionality requires expensive
hardware and highly skilled personnel for its
operation, making it a very inefficient solution.
In addition, the available replication
technologies are limited and typically choose
consistency over availability.
Although many advances have been made in
the recent years, it is still not easy to scale-out
databases or use smart partitioning schemes
for load balancing.
Dynamo: Amazon’s Highly Available Key-value Store
13. People tend to focus on consistency/availability as the sole driver of
emerging database models because it provides a simple and academic
explanation for more complex evolutionary factors. In fact, CAP
Theorem, according to its original author, “prohibits only a tiny part of
the design space: perfect availability and consistency in the presence of
partitions, which are rare… there is little reason to forfeit C or A when
the system is not partitioned.” In reality, a much larger range of
considerations and tradeoffs have informed the “NoSQL” movement…
14. Traditionally production systems store their
state in relational databases. For many of the
more common usage patterns of state
persistence, however, a relational database is
a solution that is far from ideal.
Most of these services only store and retrieve
data by primary key and do not require the
complex querying and management
functionality offered by an RDBMS.
This excess functionality requires expensive
hardware and highly skilled personnel for its
operation, making it a very inefficient solution.
In addition, the available replication
technologies are limited and typically choose
consistency over availability.
Although many advances have been made in
the recent years, it is still not easy to scale-out
databases or use smart partitioning schemes
for load balancing.
Dynamo: Amazon’s Highly Available Key-value Store
15. Spanner is Google’s scalable, multi-version,
globally- distributed, and synchronously-
replicated database… It is the first system to
distribute data at global scale and support
externally-consistent distributed transactions...
Spanner is designed to scale up to millions of
machines across hundreds of datacenters and
trillions of database rows… Spanner’s main
focus is managing cross-datacenter replicated
data…
Spanner started… as part of a rewrite of
Google’s advertising backend called F1 [35].
This backend was originally based on a MySQL
database…
Resharding this revenue-critical database as it
grew in the number of customers and their
data was extremely costly. The last resharding
took over two years of intense effort…
Spanner: Google’s Globally-Distributed Database
17. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the limitations of the
operational scenario.
18. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the limitations of the
Stringent latency requirements measured at the 99.9% percentile Highly available
Always writeable
Modeled as keys/values
operational scenario.
19. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the limitations of the
Choice to manage conflict resolution themselves or manage on the data store level
Simple, primary-key only interface No need for relational data model
operational scenario.
20. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
Functions on commodity hardware Each object must be replicated across multiple DCs
developer productivity,
Can scale out one node at a time with minimal impact on system and operators
and the limitations of the
operational scenario.
21. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the
22. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the limitations of the
1995: Less than 40 million internet users; now: 2.4 billion
Latency perceived as unavailability New types of applications
operational scenario.
23. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the limitations of the
Much more data Unstructured data
New kinds of business requirements
operational scenario.
App scales gracefully without high development overheard
24. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
Scale-out design on less expensive hardware Ability to easily meet peak loads
developer productivity,
Run efficiently across multiple sites Low operational burden
and the limitations of the
operational scenario
25. Aspects of the database:
• How to distribute data around the cluster
• Adding new nodes
• Replicating data
• Resolving data conflicts
• Dealing with failure scenarios
• Data model
26. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the how to distribute
data around the cluster
27. Database design is driven
by a virtuous tension
between the requirements
of the app, the profile of
developer productivity,
and the how to distribute
data around the cluster
28. Bunny Names A-G
Bunny Names H-R
Bunny Names R-Z
how to distribute data
around the cluster
29. Bunny Names Disproportionately Trend Towards Bunny, Cuddles, Fluffy,
Mr. Bunny, Peter Rabbit, Velveteen, Peter Cottontail, and Mitten
how to distribute data
around the cluster
50. “Every node in Dynamo should have the same
set of responsibilities as its peers; there should
be no distinguished node or nodes that take
special roles or extra set of responsibilities.”
replicating data
52. writes reads writes
writes
reads reads
writes reads
reads
reads
writes writes
reads writes reads
Clients can read / write to any node
All updates reach all replicas eventually
replicating data
60. Whether one object is a direct descendant of the other
Whether the objects are direct descendants of a common parent
Whether the objects are unrelated in recent heritage
vector clocks that show relationships between objects
resolving conflicts
61. vector clock is updated when objects are updated
last-write wins or conflicts can be resolved on client side
resolving conflicts
62. if stale responses are returned as part of the read,
those replicas are updated
resolving conflicts
68. “Most of these services only store and retrieve
data by primary key and do not require the
complex querying and management
functionality offered by an RDBMS.”
developing apps
69. “schema-less” more flexibility, agility
developing apps
70. Session User/Session ID Session Data
developing apps
71. Session User/Session ID Session Data
Advertising Campaign ID Ad Content
developing apps
72. Session User/Session ID Session Data
Advertising Campaign ID Ad Content
Logs Date Log File
developing apps
73. Session User/Session ID Session Data
Advertising Campaign ID Ad Content
Logs Date Log File
Sensor Date, Date/Time Updates
developing apps
74. Session User/Session ID Session Data
Advertising Campaign ID Ad Content
Logs Date Log File
Sensor Date, Date/Time Updates
User Data Login, Email, UUID User Attributes
developing apps
75. Session User/Session ID Session Data
Advertising Campaign ID Ad Content
Logs Date Log File
Sensor Date, Date/Time Updates
User Data Login, Email, UUID User Attributes
Content Title, Integer, Etc. Text, JSON, XML
developing apps
Basho makes a highly available, open source, distributed database called Riak that is based on the principles of the Dynamo paper.
The Dynamo paper was published about 5 years ago and outlines a set of properties for a highly available key/value store that provides an “always-on” experience for end users. Dynamo is an internal service at Amazon that power parts of AWS, including S3, and the Amazon.com shopping cart.
As the Dynamo paper discusses, a single page request to Amazon can result in requests to over 150 services, many of which have their own dependencies. High availability and low-latency must be created across the entire service-oriented architecture to present these properties in the user-facing experience.
Principles from the Dynamo paper have spawned many open-source and commercial projects including Cassandra (came out of Facebook, commercial players include Datastax and Acunu) and Project Voldemort (an open source project coming out of LinkedIn). Basho, launched in 2009, is based on many of Dynamo’s principles for distributed systems and extends the core functionality with full-text search, multi-datacenter replication, MapReduce and other features.
The Amazon shopping cart is a canonical use case of an application requiring the high availability and low-latency that Dynamo-based systems provide. Amazon itself serves as a classic example of both a new class of website/application with a different set of business requirements, as well as a new breed of infrastructure provided by its Web Services platform. For both scenarios, unavailability has a direct impact on revenue, as well as on the user trust required for a healthy business.
Beyond availability and low-latency, the Dynamo paper cites a number of other requirements for its data store. These are important design considerations that profoundly affect the implementation of the database.
In the past, these problems might have been addressed with a big server and some MySQL. Why hasn’t Amazon taken that route?
MySQL systems are consistent. Consistent systems will not be write available during failure conditions, focusing on correctness over serving requests. Ensuring write-availability is a critical aspect of the Dynamo design.
CAP theorem states that in the event of network partition, systems must choose consistency or availability – not bolth.
However, CAP is only one of many considerations in database design. We are in the early stages of databases that are designed to handle new types of operational environments and application needs / constraints.
As the Dynamo paper shows, many other factors play a role – including the expense of hardware and people to maintain the system, and the need for a scale-out model.
Indeed, even in database designs like Google’s Spanner (in contrast, a consistent distributed database), some of the primary concerns beyond CAP are the ability to scale to multiple datacenters and the operational cost of data growth.
Perhaps it’s time for a new theorem!
Amazon’s application requirements in creating Dynamo.
Amazon’s definition of developer productivity in relation to Dynamo.
Amazon’s operational requirements.
Turns out, Amazon’s need were echoing trends going on in the larger world. The big shift is not all about CAP – it’s about the big shift to distributed systems. Distributed systems aren’t just about DNS and CDN anymore. The business needs that require a distributed system are relevant to more and more companies. As a result, you’re seeing more and more “old” ideas (databases, storage, monitoring) being re-architected and re-thought for a new, distributed environment.
How requirements for applications have changed.
How the way we define developer productivity has changed.
How the operational environment we are building in has changed.
These are the other hard problems in distributed systems that touch CAP but also speak beyond it – areas the Dynamo paper posits approaches to. Dynamo can really be seen as a collection of technologies and approaches that produce the desired properties of the database… all interconnected in the circle of life. Let’s take a look at how these areas have been handled in a relational world vs how Dynamo addresses them.
In a relational world, you might have thrown all your data (let’s say you’re building a database of bunny pictures and information) on one big machine.
As you grew, you might shard the data across multiple machines through some logical division of data.
Sharding, however, can lead to hot spots in the database. Also, writing and maintaining sharding logic increases the overhead of operating and developing on the database. Significant growth of data or traffic typically means significant, often manual resharding projects. Figuring out how to intelligently split the dataset without negative impacts on performance, operations and development presents a significant challenge.
Two of Dynamo’s goals were to reduce hot spots in the database, and reduce the operational burden of scale.
To accomplish this, Dynamo uses consistent hashing, a technique first innovated at Akamai. In consistent hashing, you start with a 160-bit integer space…
And divide it into equally sized partitions that form a range of possible values.
Keys are hashed onto the integer space. Keys are “owned” by the partition they are hashed onto, and partitions, or virtual nodes, are claimed by physical nodes in the cluster. The even distribution of data created by the hashing function, and that fact that physical nodes share responsibility for partitions, results in a cluster that shares the load.
In a relational world, scaling up might mean getting bigger machines…
Or re-sharding your sharding scheme.
At massive scale though, this might land you in sharding hell.
You might note that the word “shard” appears zero times in the Dynamo paper.
When new nodes are added to a Dynamo cluster, a joining node takes over partitions until responsibility is again equal. Existing nodes handoff data for the appropriate key spaces. Cluster state is shared through “gossip protocol” and nodes update their view of the cluster as it changes.
This is one of the major innovations in the Dynamo paper. Decoupling the partitioning scheme from how partitions are assigned to physical nodes means that data load can be evenly shared by the cluster and adding nodes doesn’t requiring manually figuring out where to put data.
Data in a relational database is generally replicated using a master/slave setup to ensure consistency.
Writes are applied through a master node, which ensures that the writes are applied consistently to both master and slave nodes.
Reads can occur at slave or master nodes, but writes must go through the master.
This can be problematic in the event of master failure.
It can take time for slaves nodes to realize the master is no longer available and to elect a new master.
In the duration, that data will be unavailable to writes and updates.
Dynamo paper states that there is no master that can cause write unavailability, and that all nodes are equal.
In a Dynamo-based system, all nodes responsible for a replica can serve read and write requests. The system uses eventual consistency – as opposed to all replicas seeing an update at the same time, all replicas will see updates eventually – in practice, usually a very small window until all replicas reflect the most recent value.
Dynamo systems also provide w and r values on requests to maintain availability despite failure, and to allow the developer to tune to some extent the “correctness” of reads and writes.
Lower w and r values produce high availability and lower latency.
With a w or r value equal to the number of replicas, higher correctness is possible.
In any system that uses an eventually consistent model and replicates data, you run the risk of divergent data or “entropy”. Two common scenarios for this are when not all writes have propagated to all nodes…
Or when different clients update the same datum concurrently.
The solution… looks like this.
A vector clock is a piece of metadata attached to each object.
It gives the developer a choice: to resolve a conflict at the data store level (“last write wins”), or let the client resolve it with business logic relevant to the use case.
Dynamo also provides some nice anti-entropy features, including read repair.
Dynamo also has a slick way to maintain write availability during node failure.
If a node becomes unavailable due to hardware failure or network partition, writes/updates for that node will go to a fallback.
This neighboring node will “hand off” data when the original node returns to the cluster.
Another significant point is how Dynamo changes application development and how we define developer productivity.
There is now more unstructured data in the world, more apps that don’t require a strong schema. Using a “schema-less” key/value data model, you can eliminate some of the need for extensive data model “pre-planning”, and change applications or develop new features without changing the underlying data model. It’s simpler and for some applications that fit a key/value model, more productive.
A look at common apps and use cases for Dynamo-like systems and simple approaches to building data on them with a key/value scheme…
What does the future of NoSQL hold?
Many “ NoSQL ” systems have thrown out consistency in favor of availability…
But maybe we also threw out a lot of useful things in the process. What about higher-level data types, search, aggregation tasks, and the developer-friendly things we love in relational databases ? We’ve done a lot so far to address some of the underlying architectural requirements of new apps and platforms, but there is more we can do to enable greater queriability and broader use cases.
There is growing research and practice around ways we can offer more data types on top of distributed systems, data types that are tolerant of an eventually consistent design.
At Riak, we are applying this research to offer more advanced data types like counters and sets in a future release, and some implementations of this can already be seen in the wild.
And what about consistency? The author of CAP Theorem has stated that CAP doesn’t forbid CA during normal operations…
So can we create systems that can offer both availability and consistency or offer greater guarantees around consistency? Like offering conditional writes - failing a write if intervening requests occur, or if they don’t meet some other requirement? Or offering Paxos-like implementations to ensure strict ordering among replicas? What about consistent reads that always reflect the last written value?
The problem with offering other advanced features – like Search, MapReduce, storing terms and indexes - in distributed systems like Dynamo is that data is ALL AROUND THE CLUSTER… these tasks require finding data and performing tasks on many different nodes in the system.
Oftentimes, the response to date has been to run secondary clusters. However, this requires cluster replication and carries a higher operational burden.
What if we find ways to see data locality / even distribution as more of a spectrum, allowing us to perform advanced tasks without degrading performance or requiring secondary clusters?
Biology tells us that
Rapid emergence of new species and variations
Caused by significant events (like the Dynamo paper, the explosion of commodity hardware, cloud computing, mobile/social revolution)…
Can lead to new types of organism (NoSQL, newSQL, combinations of NoSQL and MySQL)…
But it’s our job to evolve them further into more features / functionality than ever before!