Dynamo:Theme and Variations
@shanley
Riak
150Services
 Global access         Multiple machines       Multiple datacenters                                      Scale to pea...
Traditionally production systems store their                  state in relational databases. For many of the              ...
CAP Theorem
People tend to focus on consistency/availability as the sole driver ofemerging database models because it provides a simpl...
Traditionally production systems store their                  state in relational databases. For many of the              ...
Spanner is Google’s scalable, multi-version,                  globally- distributed, and synchronously-                  r...
Shanley’s Theorem
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile of Functions on commodity h...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile of  Scale-out design on les...
Aspects of the database:•    How to distribute data around the cluster•   Adding new nodes•   Replicating data•   Resolvin...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and...
Bunny Names A-G    Bunny Names H-R     Bunny Names R-Z    how to distribute dataaround the cluster
Bunny Names Disproportionately Trend Towards Bunny, Cuddles, Fluffy,    Mr. Bunny, Peter Rabbit, Velveteen, Peter Cottonta...
how to distribute dataaround the cluster
 Reduce risk of hot spots in the database   Data is automatically assigned to nodes    how to distribute dataaround the...
how to distribute dataaround the cluster
how to distribute dataaround the cluster
how to distribute dataaround the cluster
adding new nodes
adding new nodes
Bunny Names A-Gadding new nodes
A-G   Bunny Names A-G   Bunny Names A-G   Bunny Na      adding new nodes
adding new nodes
adding new nodes
 “decoupling of partitioning and partition placement”  adding new nodes
replicating data
master                          slaveslave        replicating data
writes                      master                          slaveslave        replicating data
writes                              masterreads                                  slave        slave                replica...
master                          slaveslave        replicating data
Availability Clock                                           master1. Time to figure out the master is gone2. Master elect...
Availability Clock                                       masterConsistency > AvailabilityUnavailable to writes until data ...
replicating data
“Every node in Dynamo should have the sameset of responsibilities as its peers; there shouldbe no distinguished node or no...
writes      reads       writeswrites            reads                               reads                      writes     ...
writes      reads       writeswrites               reads                               reads                        writes...
 w and r valuesreplicating data
 number of replicas that need to participate in a read/write for a                         success response           rep...
put(                  )                            w=1 only one node needs to be available to complete write request    ...
put(           )          w=3replicating data
 reads when not all writes have propagated (laggy or down node)        resolving conflicts
 different clients update at the exact same timeresolving conflicts
a85hYGBgzmDKBVIsTFUPPmcwJTLmsTIcmsJ1nA8qzK7HcQwqfB0hzNacxCYWcA1ZIgsA
Whether one object is a direct descendant of the other   Whether the objects are direct descendants of a common parent    ...
vector clock is updated when objects are updatedlast-write wins or conflicts can be resolved on client side    resolving...
if stale responses are returned as part of the read,              those replicas are updated resolving conflicts
failure conditions
n = 3failure conditions
writes & updates                   n = 3failure conditions
hinted handofffailure conditions
developing apps
“Most of these services only store and retrieve data by primary key and do not require the complex querying and management...
“schema-less”    more flexibility, agilitydeveloping apps
Session      User/Session ID   Session Data          developing apps
Session       User/Session ID   Session DataAdvertising   Campaign ID       Ad Content          developing apps
Session       User/Session ID   Session DataAdvertising   Campaign ID       Ad ContentLogs          Date              Log ...
Session       User/Session ID   Session DataAdvertising   Campaign ID       Ad ContentLogs          Date              Log ...
Session        User/Session ID      Session DataAdvertising    Campaign ID          Ad ContentLogs           Date         ...
Session        User/Session ID         Session DataAdvertising    Campaign ID             Ad ContentLogs           Date   ...
future
future
future
more data types
 counters     sets sever side structure and conflict resolution policy    more data types
there is little reason to forfeit C or A when the  system is not partitionedstrong consistency
 conditional writes    consistent readsstrong consistency
other advanced features
other advanced features
 metadata    aggregation tasks  searchother advanced features
 metadata    aggregation tasks  searchother advanced features
other advanced features
In summary…
rapid evolutionary change
significant events
explosion of new systems
evolving into higher-order         systems
We’re hiring.shanley@basho.com
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
Upcoming SlideShare
Loading in...5
×

Dynamo Systems - QCon SF 2012 Presentation

1,353

Published on

A look at Dynamo-based systems: the architectural principles, use cases and requirements; where they differ from relational databases; and where they are going.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,353
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Twitter.com/shanley, shanley@basho.com
  • Basho makes a highly available, open source, distributed database called Riak that is based on the principles of the Dynamo paper.
  • The Dynamo paper was published about 5 years ago and outlines a set of properties for a highly available key/value store that provides an “always-on” experience for end users. Dynamo is an internal service at Amazon that power parts of AWS, including S3, and the Amazon.com shopping cart.
  • As the Dynamo paper discusses, a single page request to Amazon can result in requests to over 150 services, many of which have their own dependencies. High availability and low-latency must be created across the entire service-oriented architecture to present these properties in the user-facing experience.
  • Principles from the Dynamo paper have spawned many open-source and commercial projects including Cassandra (came out of Facebook, commercial players include Datastax and Acunu) and Project Voldemort (an open source project coming out of LinkedIn). Basho, launched in 2009, is based on many of Dynamo’s principles for distributed systems and extends the core functionality with full-text search, multi-datacenter replication, MapReduce and other features.
  • The Amazon shopping cart is a canonical use case of an application requiring the high availability and low-latency that Dynamo-based systems provide. Amazon itself serves as a classic example of both a new class of website/application with a different set of business requirements, as well as a new breed of infrastructure provided by its Web Services platform. For both scenarios, unavailability has a direct impact on revenue, as well as on the user trust required for a healthy business.
  • Beyond availability and low-latency, the Dynamo paper cites a number of other requirements for its data store. These are important design considerations that profoundly affect the implementation of the database.
  • In the past, these problems might have been addressed with a big server and some MySQL. Why hasn’t Amazon taken that route?
  • MySQL systems are consistent. Consistent systems will not be write available during failure conditions, focusing on correctness over serving requests. Ensuring write-availability is a critical aspect of the Dynamo design.
  • CAP theorem states that in the event of network partition, systems must choose consistency or availability – not bolth.
  • However, CAP is only one of many considerations in database design. We are in the early stages of databases that are designed to handle new types of operational environments and application needs / constraints.
  • As the Dynamo paper shows, many other factors play a role – including the expense of hardware and people to maintain the system, and the need for a scale-out model.
  • Indeed, even in database designs like Google’s Spanner (in contrast, a consistent distributed database), some of the primary concerns beyond CAP are the ability to scale to multiple datacenters and the operational cost of data growth.
  • Perhaps it’s time for a new theorem!
  • Amazon’s application requirements in creating Dynamo.
  • Amazon’s definition of developer productivity in relation to Dynamo.
  • Amazon’s operational requirements.
  • Turns out, Amazon’s need were echoing trends going on in the larger world. The big shift is not all about CAP – it’s about the big shift to distributed systems. Distributed systems aren’t just about DNS and CDN anymore. The business needs that require a distributed system are relevant to more and more companies. As a result, you’re seeing more and more “old” ideas (databases, storage, monitoring) being re-architected and re-thought for a new, distributed environment.
  • How requirements for applications have changed.
  • How the way we define developer productivity has changed.
  • How the operational environment we are building in has changed.
  • These are the other hard problems in distributed systems that touch CAP but also speak beyond it – areas the Dynamo paper posits approaches to. Dynamo can really be seen as a collection of technologies and approaches that produce the desired properties of the database… all interconnected in the circle of life. Let’s take a look at how these areas have been handled in a relational world vs how Dynamo addresses them.
  • In a relational world, you might have thrown all your data (let’s say you’re building a database of bunny pictures and information) on one big machine.
  • As you grew, you might shard the data across multiple machines through some logical division of data.
  • Sharding, however, can lead to hot spots in the database. Also, writing and maintaining sharding logic increases the overhead of operating and developing on the database. Significant growth of data or traffic typically means significant, often manual resharding projects. Figuring out how to intelligently split the dataset without negative impacts on performance, operations and development presents a significant challenge.
  • Two of Dynamo’s goals were to reduce hot spots in the database, and reduce the operational burden of scale.
  • To accomplish this, Dynamo uses consistent hashing, a technique first innovated at Akamai. In consistent hashing, you start with a 160-bit integer space…
  • And divide it into equally sized partitions that form a range of possible values.
  • Keys are hashed onto the integer space. Keys are “owned” by the partition they are hashed onto, and partitions, or virtual nodes, are claimed by physical nodes in the cluster. The even distribution of data created by the hashing function, and that fact that physical nodes share responsibility for partitions, results in a cluster that shares the load.
  • In a relational world, scaling up might mean getting bigger machines…
  • Or re-sharding your sharding scheme.
  • At massive scale though, this might land you in sharding hell.
  • You might note that the word “shard” appears zero times in the Dynamo paper.
  • When new nodes are added to a Dynamo cluster, a joining node takes over partitions until responsibility is again equal. Existing nodes handoff data for the appropriate key spaces. Cluster state is shared through “gossip protocol” and nodes update their view of the cluster as it changes.
  • This is one of the major innovations in the Dynamo paper. Decoupling the partitioning scheme from how partitions are assigned to physical nodes means that data load can be evenly shared by the cluster and adding nodes doesn’t requiring manually figuring out where to put data.
  • Data in a relational database is generally replicated using a master/slave setup to ensure consistency.
  • Writes are applied through a master node, which ensures that the writes are applied consistently to both master and slave nodes.
  • Reads can occur at slave or master nodes, but writes must go through the master.
  • This can be problematic in the event of master failure.
  • It can take time for slaves nodes to realize the master is no longer available and to elect a new master.
  • In the duration, that data will be unavailable to writes and updates.
  • Dynamo paper states that there is no master that can cause write unavailability, and that all nodes are equal.
  • In a Dynamo-based system, all nodes responsible for a replica can serve read and write requests. The system uses eventual consistency – as opposed to all replicas seeing an update at the same time, all replicas will see updates eventually – in practice, usually a very small window until all replicas reflect the most recent value.
  • Dynamo systems also provide w and r values on requests to maintain availability despite failure, and to allow the developer to tune to some extent the “correctness” of reads and writes.
  • Lower w and r values produce high availability and lower latency.
  • With a w or r value equal to the number of replicas, higher correctness is possible.
  • In any system that uses an eventually consistent model and replicates data, you run the risk of divergent data or “entropy”. Two common scenarios for this are when not all writes have propagated to all nodes…
  • Or when different clients update the same datum concurrently.
  • The solution… looks like this.
  • A vector clock is a piece of metadata attached to each object.
  • It gives the developer a choice: to resolve a conflict at the data store level (“last write wins”), or let the client resolve it with business logic relevant to the use case.
  • Dynamo also provides some nice anti-entropy features, including read repair.
  • Dynamo also has a slick way to maintain write availability during node failure.
  • If a node becomes unavailable due to hardware failure or network partition, writes/updates for that node will go to a fallback.
  • This neighboring node will “hand off” data when the original node returns to the cluster.
  • Another significant point is how Dynamo changes application development and how we define developer productivity.
  • There is now more unstructured data in the world, more apps that don’t require a strong schema. Using a “schema-less” key/value data model, you can eliminate some of the need for extensive data model “pre-planning”, and change applications or develop new features without changing the underlying data model. It’s simpler and for some applications that fit a key/value model, more productive.
  • A look at common apps and use cases for Dynamo-like systems and simple approaches to building data on them with a key/value scheme…
  • What does the future of NoSQL hold?
  • Many “ NoSQL ” systems have thrown out consistency in favor of availability…
  • But maybe we also threw out a lot of useful things in the process. What about higher-level data types, search, aggregation tasks, and the developer-friendly things we love in relational databases ? We’ve done a lot so far to address some of the underlying architectural requirements of new apps and platforms, but there is more we can do to enable greater queriability and broader use cases.
  • There is growing research and practice around ways we can offer more data types on top of distributed systems, data types that are tolerant of an eventually consistent design.
  • At Riak, we are applying this research to offer more advanced data types like counters and sets in a future release, and some implementations of this can already be seen in the wild.
  • And what about consistency? The author of CAP Theorem has stated that CAP doesn’t forbid CA during normal operations…
  • So can we create systems that can offer both availability and consistency or offer greater guarantees around consistency? Like offering conditional writes - failing a write if intervening requests occur, or if they don’t meet some other requirement? Or offering Paxos-like implementations to ensure strict ordering among replicas? What about consistent reads that always reflect the last written value?
  • The problem with offering other advanced features – like Search, MapReduce, storing terms and indexes - in distributed systems like Dynamo is that data is ALL AROUND THE CLUSTER… these tasks require finding data and performing tasks on many different nodes in the system.
  • Oftentimes, the response to date has been to run secondary clusters. However, this requires cluster replication and carries a higher operational burden.
  • What if we find ways to see data locality / even distribution as more of a spectrum, allowing us to perform advanced tasks without degrading performance or requiring secondary clusters?
  • Biology tells us that
  • Rapid emergence of new species and variations
  • Caused by significant events (like the Dynamo paper, the explosion of commodity hardware, cloud computing, mobile/social revolution)…
  • Can lead to new types of organism (NoSQL, newSQL, combinations of NoSQL and MySQL)…
  • But it’s our job to evolve them further into more features / functionality than ever before!
  • Because it’s hard….
  • Dynamo Systems - QCon SF 2012 Presentation

    1. 1. Dynamo:Theme and Variations
    2. 2. @shanley
    3. 3. Riak
    4. 4. 150Services
    5. 5.  Global access  Multiple machines  Multiple datacenters   Scale to peak loads easily  Tolerance of continuous failure
    6. 6. Traditionally production systems store their state in relational databases. For many of the more common usage patterns of state persistence, however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS. This excess functionality requires expensive hardware and highly skilled personnel for its operation, making it a very inefficient solution. In addition, the available replication technologies are limited and typically choose consistency over availability. Although many advances have been made in the recent years, it is still not easy to scale-out databases or use smart partitioning schemes for load balancing.Dynamo: Amazon’s Highly Available Key-value Store
    7. 7. CAP Theorem
    8. 8. People tend to focus on consistency/availability as the sole driver ofemerging database models because it provides a simple and academicexplanation for more complex evolutionary factors. In fact, CAPTheorem, according to its original author, “prohibits only a tiny part ofthe design space: perfect availability and consistency in the presence ofpartitions, which are rare… there is little reason to forfeit C or A whenthe system is not partitioned.” In reality, a much larger range ofconsiderations and tradeoffs have informed the “NoSQL” movement…
    9. 9. Traditionally production systems store their state in relational databases. For many of the more common usage patterns of state persistence, however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS. This excess functionality requires expensive hardware and highly skilled personnel for its operation, making it a very inefficient solution. In addition, the available replication technologies are limited and typically choose consistency over availability. Although many advances have been made in the recent years, it is still not easy to scale-out databases or use smart partitioning schemes for load balancing.Dynamo: Amazon’s Highly Available Key-value Store
    10. 10. Spanner is Google’s scalable, multi-version, globally- distributed, and synchronously- replicated database… It is the first system to distribute data at global scale and support externally-consistent distributed transactions... Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows… Spanner’s main focus is managing cross-datacenter replicated data… Spanner started… as part of a rewrite of Google’s advertising backend called F1 [35]. This backend was originally based on a MySQL database… Resharding this revenue-critical database as it grew in the number of customers and their data was extremely costly. The last resharding took over two years of intense effort…Spanner: Google’s Globally-Distributed Database
    11. 11. Shanley’s Theorem
    12. 12. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of theoperational scenario.
    13. 13. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  Stringent latency requirements measured at the 99.9% percentile  Highly available  Always writeable   Modeled as keys/valuesoperational scenario.
    14. 14. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  Choice to manage conflict resolution themselves or manage on the data store level  Simple, primary-key only interface  No need for relational data modeloperational scenario.
    15. 15. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile of Functions on commodity hardware  Each object must be replicated across multiple DCsdeveloper productivity,  Can scale out one node at a time with minimal impact on system and operatorsand the limitations of theoperational scenario.
    16. 16. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the
    17. 17. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  1995: Less than 40 million internet users; now: 2.4 billion  Latency perceived as unavailability  New types of applicationsoperational scenario. 
    18. 18. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the limitations of the  Much more data  Unstructured data  New kinds of business requirementsoperational scenario.  App scales gracefully without high development overheard
    19. 19. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile of  Scale-out design on less expensive hardware Ability to easily meet peak loadsdeveloper productivity,  Run efficiently across multiple sites Low operational burdenand the limitations of theoperational scenario
    20. 20. Aspects of the database:• How to distribute data around the cluster• Adding new nodes• Replicating data• Resolving data conflicts• Dealing with failure scenarios• Data model
    21. 21. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the how to distributedata around the cluster
    22. 22. Database design is drivenby a virtuous tensionbetween the requirementsof the app, the profile ofdeveloper productivity,and the how to distributedata around the cluster
    23. 23. Bunny Names A-G Bunny Names H-R Bunny Names R-Z how to distribute dataaround the cluster
    24. 24. Bunny Names Disproportionately Trend Towards Bunny, Cuddles, Fluffy, Mr. Bunny, Peter Rabbit, Velveteen, Peter Cottontail, and Mitten how to distribute dataaround the cluster
    25. 25. how to distribute dataaround the cluster
    26. 26.  Reduce risk of hot spots in the database Data is automatically assigned to nodes how to distribute dataaround the cluster
    27. 27. how to distribute dataaround the cluster
    28. 28. how to distribute dataaround the cluster
    29. 29. how to distribute dataaround the cluster
    30. 30. adding new nodes
    31. 31. adding new nodes
    32. 32. Bunny Names A-Gadding new nodes
    33. 33. A-G Bunny Names A-G Bunny Names A-G Bunny Na adding new nodes
    34. 34. adding new nodes
    35. 35. adding new nodes
    36. 36.  “decoupling of partitioning and partition placement” adding new nodes
    37. 37. replicating data
    38. 38. master slaveslave replicating data
    39. 39. writes master slaveslave replicating data
    40. 40. writes masterreads slave slave replicating data
    41. 41. master slaveslave replicating data
    42. 42. Availability Clock master1. Time to figure out the master is gone2. Master election slave slave replicating data
    43. 43. Availability Clock masterConsistency > AvailabilityUnavailable to writes until data canbe confirmed correct slave slave replicating data
    44. 44. replicating data
    45. 45. “Every node in Dynamo should have the sameset of responsibilities as its peers; there shouldbe no distinguished node or nodes that takespecial roles or extra set of responsibilities.” replicating data
    46. 46. writes reads writeswrites reads reads writes reads reads reads writes writes reads writes reads replicating data
    47. 47. writes reads writeswrites reads reads writes reads reads reads writes writes reads writes reads  Clients can read / write to any node  All updates reach all replicas eventually replicating data
    48. 48.  w and r valuesreplicating data
    49. 49.  number of replicas that need to participate in a read/write for a success response replicating data
    50. 50. put( )  w=1 only one node needs to be available to complete write request replicating data
    51. 51. put( )  w=3replicating data
    52. 52.  reads when not all writes have propagated (laggy or down node) resolving conflicts
    53. 53.  different clients update at the exact same timeresolving conflicts
    54. 54. a85hYGBgzmDKBVIsTFUPPmcwJTLmsTIcmsJ1nA8qzK7HcQwqfB0hzNacxCYWcA1ZIgsA
    55. 55. Whether one object is a direct descendant of the other Whether the objects are direct descendants of a common parent Whether the objects are unrelated in recent heritagevector clocks that show relationships between objects resolving conflicts
    56. 56. vector clock is updated when objects are updatedlast-write wins or conflicts can be resolved on client side resolving conflicts
    57. 57. if stale responses are returned as part of the read, those replicas are updated resolving conflicts
    58. 58. failure conditions
    59. 59. n = 3failure conditions
    60. 60. writes & updates n = 3failure conditions
    61. 61. hinted handofffailure conditions
    62. 62. developing apps
    63. 63. “Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS.”developing apps
    64. 64. “schema-less”  more flexibility, agilitydeveloping apps
    65. 65. Session User/Session ID Session Data developing apps
    66. 66. Session User/Session ID Session DataAdvertising Campaign ID Ad Content developing apps
    67. 67. Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log File developing apps
    68. 68. Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log FileSensor Date, Date/Time Updates developing apps
    69. 69. Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log FileSensor Date, Date/Time UpdatesUser Data Login, Email, UUID User Attributes developing apps
    70. 70. Session User/Session ID Session DataAdvertising Campaign ID Ad ContentLogs Date Log FileSensor Date, Date/Time UpdatesUser Data Login, Email, UUID User AttributesContent Title, Integer, Etc. Text, JSON, XML developing apps
    71. 71. future
    72. 72. future
    73. 73. future
    74. 74. more data types
    75. 75.  counters  sets sever side structure and conflict resolution policy more data types
    76. 76. there is little reason to forfeit C or A when the system is not partitionedstrong consistency
    77. 77.  conditional writes  consistent readsstrong consistency
    78. 78. other advanced features
    79. 79. other advanced features
    80. 80.  metadata  aggregation tasks  searchother advanced features
    81. 81.  metadata  aggregation tasks  searchother advanced features
    82. 82. other advanced features
    83. 83. In summary…
    84. 84. rapid evolutionary change
    85. 85. significant events
    86. 86. explosion of new systems
    87. 87. evolving into higher-order systems
    88. 88. We’re hiring.shanley@basho.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×