The Dynamo paper started a revolution in distributed systems. The contributions from this paper are still impacting the design and practices of some of the world's largest distributed systems, including those at Amazon.com and beyond. Building distributed systems is hard, but our goal in this session is to simplify the complexity of this topic to empower the hacker in you! Have you been bitten by the eventual consistency bug lately? We show you how to tame eventual consistency and make it a great scaling asset. As you scale up, you must be ready to deal with node, rack, and data center failure. We share insights on how to limit the blast radius of the individual components of your system, battle tested techniques for simulating failures (network partitions, data center failure), and how we used core distributed systems fundamentals to build highly scalable, performance, durable, and resilient systems. Come watch us uncover the secret sauce behind Amazon DynamoDB, Amazon SQS, Amazon SNS, and the fundamental tenents that define them as Internet scale services. To turn this session into a hacker's dream, we go over design and implementation practices you can follow to build an application with virtually limitless scalability on AWS within an hour. We even share insights and secret tips on how to make the most out of one of the services released during the morning keynote.
Azure Monitor & Application Insight to monitor Infrastructure & Application
NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013
1. SPOT 401 - Leading the NoSQL
Revolution:
under the covers of Distributed
Systems @ scale
@swami_79
@ksshams
2. what are we covering?
The evolution of large scale
distributed systems @ Amazon from
the 90’s to today
The lessons we
learned and insights
you can employ in
your own distributed
systems
@swami_79
@ksshams
3. let’s start with a story about a little
company called amazon.com
@swami_79
@ksshams
21. amazon dynamo
predecessor to
dynamoDB
replicated DHT with consistent
hashing
optimistic replication
“sloppy quorum”
anti-entropy mechanism
object versioning
specialist tool :
•limited querying capabilities
•simpler consistency
@swami_79
@ksshams
22. dynamo had many benefits
• higher availability
• we traded it off for eventual consistency
•
•
•
•
incremental scalability
no more repartitioning
no need to architect apps for peak
just add boxes
• simpler querying model ==>> predictable performance
@swami_79
@ksshams
23. but dynamo was not perfect...
lacked strong consistency
@swami_79
@ksshams
24. but dynamo was not perfect...
scaling was easier, but...
@swami_79
@ksshams
25. but dynamo was not perfect...
steep learning curve
@swami_79
@ksshams
26. but dynamo was not perfect...
dynamo was a product ... ==>> not
a service...
@swami_79
@ksshams
28. DynamoDB
• NoSQL database
• fast & predictable
performance
• seamless scalability
• easy administration
ADMIN
“Even though we have years of experience with large, complex
NoSQL architectures, we are happy to be finally out of the
business of managing it ourselves.” - Don MacAskill, CEO
@swami_79
@ksshams
34. DynamoDB Goals and
Philosophies
never compromise on
scale is our
durability
problem
easy to use
consistent and low
scale in rps
latencies
@swami_79
@ksshams
35. how to build these large scale services?
@swami_79
@ksshams
40. Fault tolerant design
is key..
• Everything fails all the time
• Planning for failures is not easy
• How do you ensure your recovery strategies work correctly?
@swami_79
@ksshams
45. Not so easy..
New member in the
group
Replica D
Replica A
Replica B
Reads and
Writes from
client B
Replica C
Should I continue to serve reads?
Should I start a new quorum?
Replica E
Writes from
client A
Replica F
Classic Split Brain Issue in Replicated systems leading to lost writes!
46. Building correct distributed systems is
not straight forward..
• How do you handle replica failures?
• How do you ensure there is not a parallel
quorum?
• How do you handle partial failures of replicas?
• How do you handle concurrent failures?
@swami_79
@ksshams
57. simulate
failures at unit
test level
fault injection
testing
scale testing
embrace failure and don’t be
surprised
datacenter
testing
network brown out
testing
70. such a service is so much more useful than just
leader election..
it became a distributed
state store
@swami_79
@ksshams
71. such a service is so much more useful than just
leader election..
or a distributed state
store
wait wait.. you’re telling me
if I poll,
I can detect node failure?
@swami_79
@ksshams
72. we acted quickly - and scaled up our entire fleet
with more nodes
doh!!!!
we slowed
consensus...
@swami_79
@ksshams
82. Real-time tweet analytics using DynamoDB
• Stream from Kinesis to DynamoDB
• What data do want in real-time?
• (per-second, top words)
• How does DynamoDB help?
• Atomic counters (per-word counts in that second)
• Indexed queries (top N word-counts in that second
83. WordCount Table
Local Secondary Index
Time
Word
Count
Time
Count
Word
2013-10-13T12:00
2013-10-13T12:00
2013-10-13T12:00
2013-10-13T12:03
Earth
Mars
Pluto
Earth
9
10
5
8
2013-10-13T12:00
2013-10-13T12:00
2013-10-13T12:00
2013-10-13T12:03
5
9
10
8
Pluto
Earth
Mars
Earth
86. Aggregate queries using Redshift
• Simple Redshift connector (buffer files, store in s3, call copy
command)
• Manifest copy connector
• 2 streams
• transaction table for deduplication
• manifest copy
87. Right tool for right job…
• Canal -> DynamoDB -> Redshift -> Glacier…
88. You are not done yet..
• Listen to customer feedback
• Iterate..
89. Example: DynamoDB
• Start with immediate needs of reliable, super scalable, low latency
datastore
• Iterate
• Developers wanted flexible query: Local Secondary Indexes
• Developers wanted parallel loads: Parallel Scans
• Mobile developers wanted direct access to their datastore: Fine-grained
Access Control
• Mobile developers wanted geo-awareness: Geospatial library
• Developers wanted DynamoDB on their laptop: DynamoDB Local
• Developers wanted richer query: Global Secondary Indexes
• We will continue to innovate..
90. Sacred Tenets in
Distributed Systems
don’t compromise durability
for performance
plan for success –
plan for scalability
plan for failures - fault tolerance is key
consistent performance
is important
release - think of blast radius
insist on correctness
@swami_79
@ksshams