Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014


Published on

AWS and operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure.
In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives; DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS.

Published in: Technology
  • Be the first to comment

(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

  1. 1. November 13, 2014 | Las Vegas, NV Al Vermeulen and Swami Sivasubramanian
  2. 2. Trend #1: The race between Computing power and expectations Computing systems keep getting more capableBut… expectations are going up even faster
  3. 3. Trend #2: Every application is a distributed app The number of computers is going up fastMany applications are distributedDistributed systems is not a niche field anymore
  4. 4. Hardware and software trends Specialized (expensive) hardware Built in redundancy Simple Software Commodity hardware Smarter software
  5. 5. With commodity hardware and scale -server failures are inevitable! But by using smarter software, we can build more robust systems Trend #3: Commodity Hardware and Smarter Software
  6. 6. Online Online Online Online Online
  7. 7. Online Online Online Online Online
  8. 8. Online Online Online Online Online
  9. 9. Cloud –Elasticity is the new normalThis results in fleets being dynamic
  10. 10. Our World
  11. 11. Challenges
  12. 12. Addressing Distributed Computing challengesprimitives
  13. 13. Core Distributed Systems Primitives Group Membership Discovery Metadata Store Failure Detection Workflows
  14. 14. Group Membership
  15. 15. Group Membership– Amazon RDS Multi-AZ Amazon ElastiCache Group –List of caches in a Memcachegroup
  16. 16. An example… Replica BReplica C Writes from client AReplica AReplica DNew member in the groupShould I continue to serve reads? Should I start a new quorum? Replica EReplica F Reads and Writes from client B Classic Split Brain Issue in Replicated systems leading to lost writes!
  17. 17. Group Membership Fundamentals Addinga new member to the group Removinga member from the group Discoveringwhen the group membership changes Discoveringroles within the group
  18. 18. Discovery
  19. 19. Discovery
  20. 20. Discovery –Configuration File
  21. 21. Discovery -DNS
  22. 22. Discovery –DNS (cons)
  23. 23. Discovery –Gossip Protocol
  24. 24. Cons) Discovery –Gossip Protocol (Con)
  25. 25. Discovery –Metadata store/consensusAmazon DynamoDB
  26. 26. Metadata Store
  27. 27. Metadata Store
  28. 28. Metadata Store -what are good characteristics? Simplicity Availability Scalability Amazon DynamoDB –top choice for metadata storage in Amazon
  29. 29. Metadata Store –Lessons Learned
  30. 30. Failure Detection
  31. 31. Failure Detection -Challenges
  32. 32. Failure Detection -Techniques
  33. 33. Failure detection: Lessons Learned
  34. 34. Workflows
  35. 35. Workflow –What is it? To execute a series of actions asynchronously
  36. 36. What is a workflow?
  37. 37. What is not a workflow? synchronousasynchronous
  38. 38. Workflow –A simple script
  39. 39. Workflow –Recommended Approach Activity 1 Activity 2 Activity 3 Activity 4
  40. 40. Workflow –Lessons Learned Idempotent metadataAmazon Simple Workflow Service
  41. 41. What is the underlying problem? Group Membership Discovery Metadata Store Failure Detection Workflows
  42. 42. Consensus
  43. 43. Paxos and consensus single point of failure Paxos at the bottom broken
  44. 44. Consensus –Lessons Learned
  45. 45. consensus..
  46. 46. Paxos at Amazon
  47. 47. Group Membership Discovery Metadata Store Failure Detection Workflows Lock Management Amazon Kinesis Amazon DynamoDB Streams ???
  48. 48. Summary
  49. 49.