Dynamo Systems - QCon SF 2012 Presentation
Upcoming SlideShare
Loading in...5
×
 

Dynamo Systems - QCon SF 2012 Presentation

on

  • 1,513 views

A look at Dynamo-based systems: the architectural principles, use cases and requirements; where they differ from relational databases; and where they are going.

A look at Dynamo-based systems: the architectural principles, use cases and requirements; where they differ from relational databases; and where they are going.

Statistics

Views

Total Views
1,513
Slideshare-icon Views on SlideShare
1,412
Embed Views
101

Actions

Likes
2
Downloads
7
Comments
0

1 Embed 101

https://twitter.com 101

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Twitter.com/shanley, shanley@basho.com
  • Basho makes a highly available, open source, distributed database called Riak that is based on the principles of the Dynamo paper.
  • The Dynamo paper was published about 5 years ago and outlines a set of properties for a highly available key/value store that provides an “always-on” experience for end users. Dynamo is an internal service at Amazon that power parts of AWS, including S3, and the Amazon.com shopping cart.
  • As the Dynamo paper discusses, a single page request to Amazon can result in requests to over 150 services, many of which have their own dependencies. High availability and low-latency must be created across the entire service-oriented architecture to present these properties in the user-facing experience.
  • Principles from the Dynamo paper have spawned many open-source and commercial projects including Cassandra (came out of Facebook, commercial players include Datastax and Acunu) and Project Voldemort (an open source project coming out of LinkedIn). Basho, launched in 2009, is based on many of Dynamo’s principles for distributed systems and extends the core functionality with full-text search, multi-datacenter replication, MapReduce and other features.
  • The Amazon shopping cart is a canonical use case of an application requiring the high availability and low-latency that Dynamo-based systems provide. Amazon itself serves as a classic example of both a new class of website/application with a different set of business requirements, as well as a new breed of infrastructure provided by its Web Services platform. For both scenarios, unavailability has a direct impact on revenue, as well as on the user trust required for a healthy business.
  • Beyond availability and low-latency, the Dynamo paper cites a number of other requirements for its data store. These are important design considerations that profoundly affect the implementation of the database.
  • In the past, these problems might have been addressed with a big server and some MySQL. Why hasn’t Amazon taken that route?
  • MySQL systems are consistent. Consistent systems will not be write available during failure conditions, focusing on correctness over serving requests. Ensuring write-availability is a critical aspect of the Dynamo design.
  • CAP theorem states that in the event of network partition, systems must choose consistency or availability – not bolth.
  • However, CAP is only one of many considerations in database design. We are in the early stages of databases that are designed to handle new types of operational environments and application needs / constraints.
  • As the Dynamo paper shows, many other factors play a role – including the expense of hardware and people to maintain the system, and the need for a scale-out model.
  • Indeed, even in database designs like Google’s Spanner (in contrast, a consistent distributed database), some of the primary concerns beyond CAP are the ability to scale to multiple datacenters and the operational cost of data growth.
  • Perhaps it’s time for a new theorem!
  • Amazon’s application requirements in creating Dynamo.
  • Amazon’s definition of developer productivity in relation to Dynamo.
  • Amazon’s operational requirements.
  • Turns out, Amazon’s need were echoing trends going on in the larger world. The big shift is not all about CAP – it’s about the big shift to distributed systems. Distributed systems aren’t just about DNS and CDN anymore. The business needs that require a distributed system are relevant to more and more companies. As a result, you’re seeing more and more “old” ideas (databases, storage, monitoring) being re-architected and re-thought for a new, distributed environment.
  • How requirements for applications have changed.
  • How the way we define developer productivity has changed.
  • How the operational environment we are building in has changed.
  • These are the other hard problems in distributed systems that touch CAP but also speak beyond it – areas the Dynamo paper posits approaches to. Dynamo can really be seen as a collection of technologies and approaches that produce the desired properties of the database… all interconnected in the circle of life. Let’s take a look at how these areas have been handled in a relational world vs how Dynamo addresses them.
  • In a relational world, you might have thrown all your data (let’s say you’re building a database of bunny pictures and information) on one big machine.
  • As you grew, you might shard the data across multiple machines through some logical division of data.
  • Sharding, however, can lead to hot spots in the database. Also, writing and maintaining sharding logic increases the overhead of operating and developing on the database. Significant growth of data or traffic typically means significant, often manual resharding projects. Figuring out how to intelligently split the dataset without negative impacts on performance, operations and development presents a significant challenge.
  • Two of Dynamo’s goals were to reduce hot spots in the database, and reduce the operational burden of scale.
  • To accomplish this, Dynamo uses consistent hashing, a technique first innovated at Akamai. In consistent hashing, you start with a 160-bit integer space…
  • And divide it into equally sized partitions that form a range of possible values.
  • Keys are hashed onto the integer space. Keys are “owned” by the partition they are hashed onto, and partitions, or virtual nodes, are claimed by physical nodes in the cluster. The even distribution of data created by the hashing function, and that fact that physical nodes share responsibility for partitions, results in a cluster that shares the load.
  • In a relational world, scaling up might mean getting bigger machines…
  • Or re-sharding your sharding scheme.
  • At massive scale though, this might land you in sharding hell.
  • You might note that the word “shard” appears zero times in the Dynamo paper.
  • When new nodes are added to a Dynamo cluster, a joining node takes over partitions until responsibility is again equal. Existing nodes handoff data for the appropriate key spaces. Cluster state is shared through “gossip protocol” and nodes update their view of the cluster as it changes.
  • This is one of the major innovations in the Dynamo paper. Decoupling the partitioning scheme from how partitions are assigned to physical nodes means that data load can be evenly shared by the cluster and adding nodes doesn’t requiring manually figuring out where to put data.
  • Data in a relational database is generally replicated using a master/slave setup to ensure consistency.
  • Writes are applied through a master node, which ensures that the writes are applied consistently to both master and slave nodes.
  • Reads can occur at slave or master nodes, but writes must go through the master.
  • This can be problematic in the event of master failure.
  • It can take time for slaves nodes to realize the master is no longer available and to elect a new master.
  • In the duration, that data will be unavailable to writes and updates.
  • Dynamo paper states that there is no master that can cause write unavailability, and that all nodes are equal.
  • In a Dynamo-based system, all nodes responsible for a replica can serve read and write requests. The system uses eventual consistency – as opposed to all replicas seeing an update at the same time, all replicas will see updates eventually – in practice, usually a very small window until all replicas reflect the most recent value.
  • Dynamo systems also provide w and r values on requests to maintain availability despite failure, and to allow the developer to tune to some extent the “correctness” of reads and writes.
  • Lower w and r values produce high availability and lower latency.
  • With a w or r value equal to the number of replicas, higher correctness is possible.
  • In any system that uses an eventually consistent model and replicates data, you run the risk of divergent data or “entropy”. Two common scenarios for this are when not all writes have propagated to all nodes…
  • Or when different clients update the same datum concurrently.
  • The solution… looks like this.
  • A vector clock is a piece of metadata attached to each object.
  • It gives the developer a choice: to resolve a conflict at the data store level (“last write wins”), or let the client resolve it with business logic relevant to the use case.
  • Dynamo also provides some nice anti-entropy features, including read repair.
  • Dynamo also has a slick way to maintain write availability during node failure.
  • If a node becomes unavailable due to hardware failure or network partition, writes/updates for that node will go to a fallback.
  • This neighboring node will “hand off” data when the original node returns to the cluster.
  • Another significant point is how Dynamo changes application development and how we define developer productivity.
  • There is now more unstructured data in the world, more apps that don’t require a strong schema. Using a “schema-less” key/value data model, you can eliminate some of the need for extensive data model “pre-planning”, and change applications or develop new features without changing the underlying data model. It’s simpler and for some applications that fit a key/value model, more productive.
  • A look at common apps and use cases for Dynamo-like systems and simple approaches to building data on them with a key/value scheme…
  • What does the future of NoSQL hold?
  • Many “ NoSQL ” systems have thrown out consistency in favor of availability…
  • But maybe we also threw out a lot of useful things in the process. What about higher-level data types, search, aggregation tasks, and the developer-friendly things we love in relational databases ? We’ve done a lot so far to address some of the underlying architectural requirements of new apps and platforms, but there is more we can do to enable greater queriability and broader use cases.
  • There is growing research and practice around ways we can offer more data types on top of distributed systems, data types that are tolerant of an eventually consistent design.
  • At Riak, we are applying this research to offer more advanced data types like counters and sets in a future release, and some implementations of this can already be seen in the wild.
  • And what about consistency? The author of CAP Theorem has stated that CAP doesn’t forbid CA during normal operations…
  • So can we create systems that can offer both availability and consistency or offer greater guarantees around consistency? Like offering conditional writes - failing a write if intervening requests occur, or if they don’t meet some other requirement? Or offering Paxos-like implementations to ensure strict ordering among replicas? What about consistent reads that always reflect the last written value?
  • The problem with offering other advanced features – like Search, MapReduce, storing terms and indexes - in distributed systems like Dynamo is that data is ALL AROUND THE CLUSTER… these tasks require finding data and performing tasks on many different nodes in the system.
  • Oftentimes, the response to date has been to run secondary clusters. However, this requires cluster replication and carries a higher operational burden.
  • What if we find ways to see data locality / even distribution as more of a spectrum, allowing us to perform advanced tasks without degrading performance or requiring secondary clusters?
  • Biology tells us that
  • Rapid emergence of new species and variations
  • Caused by significant events (like the Dynamo paper, the explosion of commodity hardware, cloud computing, mobile/social revolution)…
  • Can lead to new types of organism (NoSQL, newSQL, combinations of NoSQL and MySQL)…
  • But it’s our job to evolve them further into more features / functionality than ever before!
  • Because it’s hard….

Dynamo Systems - QCon SF 2012 Presentation Dynamo Systems - QCon SF 2012 Presentation Presentation Transcript

  • Dynamo:Theme and Variations
  • @shanley
  • Riak
  • 150Services
  •  Global access  Multiple machines  Multiple datacenters   Scale to peak loads easily  Tolerance of continuous failure
  • Traditionally production systems store their state in relational databases. For many of the more common usage patterns of state persistence, however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS. This excess functionality requires expensive hardware and highly skilled personnel for its operation, making it a very inefficient solution. In addition, the available replication technologies are limited and typically choose consistency over availability. Although many advances have been made in the recent years, it is still not easy to scale-out databases or use smart partitioning schemes for load balancing.Dynamo: Amazon’s Highly Available Key-value Store
  • CAP Theorem
  • People tend to focus on consistency/availability as the sole driver ofemerging database models because it provides a simple and academicexplanation for more complex evolutionary factors. In fact, CAPTheorem, according to its original author, “prohibits only a tiny part ofthe design space: perfect availability and consistency in the presence ofpartitions, which are rare… there is little reason to forfeit C or A whenthe system is not partitioned.” In reality, a much larger range ofconsiderations and tradeoffs have informed the “NoSQL” movement…
  • Traditionally production systems store their state in relational databases. For many of the more common usage patterns of state persistence, however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS. This excess functionality requires expensive hardware and highly skilled personnel for its operation, making it a very inefficient solution. In addition, the available replication technologies are limited and typically choose consistency over availability. Although many advances have been made in the recent years, it is still not easy to scale-out databases or use smart partitioning schemes for load balancing.Dynamo: Amazon’s Highly Available Key-value Store
  • Spanner is Google’s scalable, multi-version, globally- distributed, and synchronously- replicated database… It is the first system to distribute data at global scale and support externally-consistent distributed transactions... Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows… Spanner’s main focus is managing cross-datacenter replicated data… Spanner started… as part of a rewrite of Google’s advertising backend called F1 [35]. This backend was originally based on a MySQL database… Resharding this revenue-critical database as it grew in the number of customers and their data was extremely costly. The last resharding took over two years of intense effort…Spanner: Google’s Globally-Distributed Database
  • Shanley’s Theorem
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of theoperational scenario.
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  Stringent latency requirements measured at the 99.9% percentile  Highly available  Always writeable   Modeled as keys/valuesoperational scenario.
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  Choice to manage conflict resolution themselves or manage on the data store level  Simple, primary-key only interface  No need for relational data modeloperational scenario.
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile of Functions on commodity hardware  Each object must be replicated across multiple DCsdeveloper productivity,  Can scale out one node at a time with minimal impact on system and operatorsand the limitations of theoperational scenario.
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  1995: Less than 40 million internet users; now: 2.4 billion  Latency perceived as unavailability  New types of applicationsoperational scenario. 
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  Much more data  Unstructured data  New kinds of business requirementsoperational scenario.  App scales gracefully without high development overheard
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile of  Scale-out design on less expensive hardware Ability to easily meet peak loadsdeveloper productivity,  Run efficiently across multiple sites Low operational burdenand the limitations of theoperational scenario
  • Aspects of the database:• How to distribute data around the cluster• Adding new nodes• Replicating data• Resolving data conflicts• Dealing with failure scenarios• Data model
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the how to distributedata around the cluster
  • Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the how to distributedata around the cluster
  • Bunny Names A-G Bunny Names H-R Bunny Names R-Z how to distribute dataaround the cluster
  • Bunny Names Disproportionately Trend Towards Bunny, Cuddles, Fluffy, Mr. Bunny, Peter Rabbit, Velveteen, Peter Cottontail, and Mitten how to distribute dataaround the cluster
  • how to distribute dataaround the cluster
  •  Reduce risk of hot spots in the database Data is automatically assigned to nodes how to distribute dataaround the cluster
  • how to distribute dataaround the cluster
  • how to distribute dataaround the cluster
  • how to distribute dataaround the cluster
  • adding new nodes
  • adding new nodes
  • Bunny Names A-Gadding new nodes
  • A-G Bunny Names A-G Bunny Names A-G Bunny Na adding new nodes
  • adding new nodes
  • adding new nodes
  •  “decoupling of partitioning and partition placement” adding new nodes
  • replicating data
  • master slaveslave replicating data
  • writes master slaveslave replicating data
  • writes masterreads slave slave replicating data
  • master slaveslave replicating data
  • Availability Clock master1. Time to figure out the master is gone2. Master election slave slave replicating data
  • Availability Clock masterConsistency > AvailabilityUnavailable to writes until data canbe confirmed correct slave slave replicating data
  • replicating data
  • “Every node in Dynamo should have the sameset of responsibilities as its peers; there shouldbe no distinguished node or nodes that takespecial roles or extra set of responsibilities.” replicating data
  • writes reads writeswrites reads reads writes reads reads reads writes writes reads writes reads replicating data
  • writes reads writeswrites reads reads writes reads reads reads writes writes reads writes reads  Clients can read / write to any node  All updates reach all replicas eventually replicating data
  •  w and r valuesreplicating data
  •  number of replicas that need to participate in a read/write for a success response replicating data
  • put( )  w=1 only one node needs to be available to complete write request replicating data
  • put( )  w=3replicating data
  •  reads when not all writes have propagated (laggy or down node) resolving conflicts
  •  different clients update at the exact same timeresolving conflicts
  • a85hYGBgzmDKBVIsTFUPPmcwJTLmsTIcmsJ1nA8qzK7HcQwqfB0hzNacxCYWcA1ZIgsA
  • Whether one object is a direct descendant of the other Whether the objects are direct descendants of a common parent Whether the objects are unrelated in recent heritagevector clocks that show relationships between objects resolving conflicts
  • vector clock is updated when objects are updatedlast-write wins or conflicts can be resolved on client side resolving conflicts
  • if stale responses are returned as part of the read, those replicas are updated resolving conflicts
  • failure conditions
  • n = 3failure conditions
  • writes & updates n = 3failure conditions
  • hinted handofffailure conditions
  • developing apps
  • “Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS.”developing apps
  • “schema-less”  more flexibility, agilitydeveloping apps
  • Session User/Session ID Session Data developing apps
  • Session User/Session ID Session DataAdvertising Campaign ID Ad Content developing apps
  • Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log File developing apps
  • Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log FileSensor Date, Date/Time Updates developing apps
  • Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log FileSensor Date, Date/Time UpdatesUser Data Login, Email, UUID User Attributes developing apps
  • Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log FileSensor Date, Date/Time UpdatesUser Data Login, Email, UUID User AttributesContent Title, Integer, Etc. Text, JSON, XML developing apps
  • future
  • future
  • future
  • more data types
  •  counters  sets sever side structure and conflict resolution policy more data types
  • there is little reason to forfeit C or A when the system is not partitionedstrong consistency
  •  conditional writes  consistent readsstrong consistency
  • other advanced features
  • other advanced features
  •  metadata  aggregation tasks  searchother advanced features
  •  metadata  aggregation tasks  searchother advanced features
  • other advanced features
  • In summary…
  • rapid evolutionary change
  • significant events
  • explosion of new systems
  • evolving into higher-order systems
  • We’re hiring.shanley@basho.com