• Save
Riak and Dynamo, Five Years Later
Upcoming SlideShare
Loading in...5
×
 

Riak and Dynamo, Five Years Later

on

  • 717 views

Video and slides synchronized, mp3 and slide download available at http://bit.ly/15zYFnU. ...

Video and slides synchronized, mp3 and slide download available at http://bit.ly/15zYFnU.

Andy Gross reflects on five years of involvement with Riak and distributed databases and discusses what went right, what went wrong, and what the next five years may hold for Riak. Filmed at qconlondon.com.

Andy Gross is a distributed systems nerd, co-creator of Riak and Webmachine, and Chief Architect at Basho Technologies. Before Basho, Andy hacked on various distributed systems at Apple, Akamai, and Mochi Media. Twitter: @argv0 Blog: blog.basho.com

Statistics

Views

Total Views
717
Views on SlideShare
717
Embed Views
0

Actions

Likes
3
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Riak and Dynamo, Five Years Later Riak and Dynamo, Five Years Later Presentation Transcript

    • Dynamo, Five Years LaterAndy GrossChief Architect, Basho TechnologiesQCon London 2013Friday, March 8, 13
    • InfoQ.com: News & Community Site• 750,000 unique visitors/month• Published in 4 languages (English, Chinese, Japanese and BrazilianPortuguese)• Post content from our QCon conferences• News 15-20 / week• Articles 3-4 / week• Presentations (videos) 12-15 / week• Interviews 2-3 / week• Books 1 / monthWatch the video with slidesynchronization on InfoQ.com!http://www.infoq.com/presentations/riak-dynamo
    • Presented at QCon Londonwww.qconlondon.comPurpose of QCon- to empower software development by facilitating the spread ofknowledge and innovationStrategy- practitioner-driven conference designed for YOU: influencers ofchange and innovation in your teams- speakers and topics driving the evolution and innovation- connecting and catalyzing the influencers and innovatorsHighlights- attended by more than 12,000 delegates since 2007- held in 9 cities worldwide
    • DynamoPublished October 2007 @ SOSPDescribes a collection of distributed systemstechniques applied to low-latency key-value storageSpawned (along with BigTable) many imitators, anindustry (LinkedIn -> Voldemort, Facebook ->Cassandra)Authors nearly got fired from Amazon for publishingFriday, March 8, 13
    • Riak - A Dynamo CloneFirst lines of first prototype written in Fall 2007 on aplane on the way to my Basho interview“Technical Debt” is another term we use at Basho forthis codeMostly Erlang with some C/C++Apache2 LicensedFirst release in 2009, 1.3 released 2/21/13Friday, March 8, 13
    • BashoFriday, March 8, 13
    • BashoFounded late 2007 by ex-Akamai peopleCurrently ~120 employees, distributed, with offices inCambridge, San Francisco, London, and TokyoWe sponsor of Riak Open SourceWe sell Riak Enterprise (Riak + Multi-DC replication)We sell Riak CS (S3 clone backed by Riak Enterprise)Friday, March 8, 13
    • PrinciplesAlways-writableIncrementally scalableSymmetricalDecentralizedHeterogenousFocus on SLAs, tail latencyFriday, March 8, 13
    • TechniquesConsistent HashingVector ClocksRead RepairAnti-EntropyHinted HandoffGossip ProtocolFriday, March 8, 13
    • Consistent HashingInvented by Danny Lewin and others @ MIT/AkamaiMinimizes remapping of keys when number of hashslots changesOriginally applied to CDNs, used in Dynamo for replicaplacementEnables incremental scalability, even spreadMinimizes hot spotsFriday, March 8, 13
    • Friday, March 8, 13
    • Vector ClocksIntroduced by Mattern et al, in 1988Extends Lamport’s timestamps (1978)Each value in Dynamo tagged with vector clockAllows detection of stale values, logical siblingsFriday, March 8, 13
    • Read RepairUpdate stale versions opportunistically on reads(instead of writes)Pushes system toward consistency, after returningvalue to clientReflects focus on a cheap, always-available write pathFriday, March 8, 13
    • Hinted HandoffAny node can accept writes for other nodes if they’redownAll messages include a destinationData accepted by node other than destination ishanded off when node recoversAs long as a single node is alive the cluster can accepta writeFriday, March 8, 13
    • Anti-EntropyReplicas maintain a Merkle Tree of keys and theirversions/hashesTrees periodically exchanged with peer vnodesMerkle tree enables cheap comparisonOnly values with different hashes are exchangedPushes system toward consistencyFriday, March 8, 13
    • Gossip ProtocolDecentralized approach to managing global stateTrades off atomicity of state changes for adecentralized approachVolume of gossip can overwhelm networks withoutcareFriday, March 8, 13
    • Hinted HandoffFriday, March 8, 13
    • Hinted Handoff• Node failsXXXXXXXXFriday, March 8, 13
    • Hinted Handoff• Node fails• Requests go to fallbackhash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)XXXXXXXXFriday, March 8, 13
    • Hinted Handoff• Node fails• Requests go to fallback• Node comes backhash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Friday, March 8, 13
    • Hinted Handoff• Node fails• Requests go to fallback• Node comes back• “Handoff” - data returnsto recovered nodehash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Friday, March 8, 13
    • Hinted Handoff• Node fails• Requests go to fallback• Node comes back• “Handoff” - data returnsto recovered node• Normal operationsresumehash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Friday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Friday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)clientRiakFriday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakFriday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakhash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)== 10, 11, 12Friday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakhash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)== 10, 11, 12Coordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16The RingFriday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Coordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16The RingFriday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakCoordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16The RingR=2Friday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakCoordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16The RingR=2 v1Friday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakR=2 v1 v2Friday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakR=2 v2v2Friday, March 8, 13
    • Anatomy of a Requestget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)v2Friday, March 8, 13
    • Read Repairget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakCoordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16R=2 v1 v2v2v2v1Friday, March 8, 13
    • Read Repairget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakCoordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16R=2 v2v2v2v1Friday, March 8, 13
    • Read Repairget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakCoordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16R=2 v2v2v2v1v1Friday, March 8, 13
    • v2v2Read Repairget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)Get Handler (FSM)clientRiakCoordinating nodeCluster6 7 8 9 10 11 12 13 14 15 16R=2 v2v2v2v2v2Friday, March 8, 13
    • Erlang/OTP RuntimeRiak ArchitectureFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIs HTTPFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIs HTTP Protocol BuffersFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIs HTTP Protocol BuffersErlang local clientFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationHTTP Protocol BuffersErlang local clientFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest Coordinationget put delete map-reduceHTTP Protocol BuffersErlang local clientFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientconsistent hashingFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashingFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffnode-livenessFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffnode-livenessgossipFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffnode-livenessgossipbucketsFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffnode-livenessgossipbucketsvnode masterFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffnode-livenessgossipbucketsvnodesvnode masterFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffnode-livenessgossipbucketsvnodesstorage backendvnode masterFriday, March 8, 13
    • Erlang/OTP RuntimeRiak KVRiak ArchitectureClient APIsRequest CoordinationRiak Coreget put delete map-reduceHTTP Protocol BuffersErlang local clientmembershipconsistent hashing handoffnode-livenessgossipbucketsvnodesstorage backendJS Runtimevnode masterFriday, March 8, 13
    • Problems with DynamoEventual Consistency is harsh mistressPushes conflict resolution to clientsKey/value data types limited in useRandom replica placement destroys localityGossip protocol can limit cluster sizeR+W > N is NOT more consistentTCP IncastFriday, March 8, 13
    • Key-ValueConflict ResolutionForcing clients to resolve consistency issues on read isa pain for developersMost end up choosing the server-enforced last-write-wins policyWith many language clients, logic must beimplemented many timesOne solution: https://github.com/bumptech/montageAnother: Make everything immutableAnother: CRDTsFriday, March 8, 13
    • Optimize for Immutability“Accountants don’t use erasers” - Pat HellandEventual consistency is *great* for immutable dataConflicts become a non-issue if data never changesdon’t need full quorums, vector clocksbackend optimizations are possibleProblem space shifts to distributed GC... which is very hard, but not the user’s problemanymoreFriday, March 8, 13
    • CRDTsConflict-free|Commutative Replicated Data TypesA server side structure and conflict-resolution policy forricher datatypes like counters and sets amenable toeventual consistencyLetia et al. (2009). CRDTs: Consistency withoutconcurrency control: http://hal.inria.fr/inria-00397981/enPrototype here: http://github.com/basho/riak_dtFriday, March 8, 13
    • Random Placement andLocalityBy default, keys are randomly placed on differentreplicasBut we have buckets!Containers imply cheap iteration/enumeration, but withrandom placement it becomes an expensive full-scanPartial Solution: hash function defined per-bucket canincrease localityLots of work done to minimize impact of bucket listingsFriday, March 8, 13
    • (R+W>N) != ConsistencyR+W described in Dynamo paper as “consistencyknobs”Some Basho/Riak docs still say this too! :(Even if R=W=N, sloppy quorums and partial writesmake reading old values possible“Read your own writes if your writes succeed butotherwise you have no idea what you’re going to readconsistency (RYOWIWSBOYHNIWYGTRC)” - JoeBlomstedtSolution: actual “strong” consistencyFriday, March 8, 13
    • Strong Consistency in RiakCAP says you must choose C vs. A, but only duringfailuresThere’s no reason we can’t implement both models,with different tradeoffsEnable strong consistency on a per-bucket basisSee Joe Blomstedt’s talk at RICON 2012: http://ricon2012.com, earlier work at:http://github.com/jtuple/riak_zabFriday, March 8, 13
    • An Aside: ProbabalisticallyBounded StalenessBailis et al. : http://pbs.cs.berkeley.eduR=W=1, .1ms latency at all hopsFriday, March 8, 13
    • TCP Incast“You can’t pour two buckets of manure into onebucket” - Scott Fritchie’s Grandfather“microbursts” of traffic sent to one cluster memberCoordinator sends request to three replicasAll respond with large-ish result at roughly the sametimeSwitch has to either buffer or drop packetsCassandra tries to mitigate: 1 replica sends data,others send hashes. We should do this in Riak.Friday, March 8, 13
    • What Riak Did Differently(or wrong)Screwed up vector clock implementationActor IDs in vector clocks were client ids, thereforepotentially unboundedSize explosion resulted in huge objects, causedOOM crashesVector clock pruning resulted in false siblingsFixed by forwarding to node in preflist circa 1.0Friday, March 8, 13
    • What Riak Did DifferentlyNo active anti-entropy until v1.3Early versions had slow, unstable AAENode loss required reading all objects andrepopulating replicas via read repairOk for objects that are read oftenRarely-read objects N value decreases over timeFriday, March 8, 13
    • What Riak Did DifferentlyInitial versions had an unavailability window duringtopology changesNodes would claim partitions immediately, beforedata had been handed offNew versions don’t change request preflist until alldata has been handed offImplemented as 2PC-ish commit over gossipFriday, March 8, 13
    • Riak, Beyond DynamoMapReduceSearchSecondary IndexesPre/post-commit hooksMulti-DC replicationRiak Pipe distributed computationRiak CSFriday, March 8, 13
    • Riak CSAmazon S3 clone implemented as a proxy in front ofRiakHandles eventual consistency issues, object chunking,multitenancy, and API for a much narrower use caseForced us to eat our own dogfood and get seriousabout fixing long-standing wartsDrives feature developmentFriday, March 8, 13
    • Riak the Product vs.Dynamo the ServiceDynamo had luxury of being a service while Riak is aproductScrewing things up with Riak can not be fixed withan emergency deployMultiple platforms, packaging are challengesTesting distributed systems is another talk entirely(QuickCheck FTW)http://www.erlang-factory.com/upload/presentations/514/TestFirstConstructionDistributedSystems.pdfFriday, March 8, 13
    • Riak CoreSome of our best work!Dynamo abstractedImplements all Dynamo techniques without prescribinga use caseExamples of Riak Core apps:Riak KV!Riak SearchRiak PipeFriday, March 8, 13
    • Riak CoreProduction deploymentsOpenX: several 100+-node clusters of custom RiakCore systemsStackMob: proxy for mobile services implementedwith Riak CoreNeeds to be much easier to use and betterdocumentedFriday, March 8, 13
    • Multi-Datacenter ReplicationIntra-cluster replication in Riak is optimized forconsistently low latency, high throughputWAN replication needs to deal with lossy links, long fatnetworks, TCP odditiesMDC replication has 2 phases: full-sync (per-partitionmerkle comparisons), real-time (asynchronous, drivenby post-commit hook)Separate policies settable per-bucketFriday, March 8, 13
    • ErlangStill the best language for this stuff, butWe mix data and control messages over Erlangmessage passing. Switch to TCP (or uTP/UDT) fordataNIFs are problematicVM tuning can be a dark art~90 public repos of mostly-Erlang, mostly-awesomeopen source: https://github.com/bashoFriday, March 8, 13
    • Other Future DirectionsSecurity was not a factor in Dynamo’s or Riak’s designIsolating Riak increases operational complexity, costStatically sized ring is a painExplore possibilities with smarter clientsSupport larger clustersMultitenancy, tenant isolationMore vertical products like Riak CSFriday, March 8, 13
    • Questions?@argv0http://www.basho.comhttp://github.com/bashohttp://docs.basho.comFriday, March 8, 13