• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Ricon2013 preso upload

Ricon2013 preso upload



Slides from my talk at riconwest on 2013-Oct-30. I compare Cassandra and Riak and present how, hypothetically, Riak could be used at Netflix.

Slides from my talk at riconwest on 2013-Oct-30. I compare Cassandra and Riak and present how, hypothetically, Riak could be used at Netflix.



Total Views
Views on SlideShare
Embed Views



2 Embeds 51

https://twitter.com 50
https://www.rebelmouse.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Ricon2013 preso upload Ricon2013 preso upload Presentation Transcript

    • Dynamic Dynamos: Riak and Cassandra Jason Brown, @jasobrown Senior Software Engineer, Netflix #riconwest 2013-Oct-30
    • CHOICE
    • Choice The whole "NoSQL movement" is really about choice. At scale there will never be a single solution that is best for everyone. @jtuple, “Absolute Consistency”, Riak ML, 2012-Jan-11 http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-January/007157.html
    • whoami ● @Netflix, > 5 years ● Apache Cassandra committer ● wannabe distributed systems geek
    • Netflix and Cassandra ● ● ● ● long-time Oracle shop Aug 2008 needed new db for cloud migration 2010 - selected cassandra ○ dynamo-style, masterless system ○ multi-datacenter support ○ written in Java
    • Netflix’s C* Prod Deployment Production clusters > 65 Production nodes > 2300 Multi-region clusters > 40 Most regions used 4 (three clusters) Total data ~300 TB Largest cluster 288 nodes (actually, 576 nodes) Max reads/writes 300k rps / 1.3m wps
    • Jason’s whiteboard, summer 2013 <image of white board here>
    • Why Riak? ● Another dynamo-style system ● vector clocks ● not java ● not jvm
    • Talk Structure 1. Comparisons a. Write/Read path b. Conflict resolution c. Anti-entropy d. Multiple Datacenter support 2. Riak @ Netflix
    • Data modeling Riak ● key -> value Cassandra ● columnar layout ● row key with one to many columns
    • Virtual Nodes ● Split hash ring into smaller chunks ● Physical node responsible for 1..n tokens Cassandra ● purely for routing Riak ● burrowed deep into the code base
    • Write Path - Cassandra ● Coordinator gets request ● Determine replica nodes, in all DCs ● Send to all in local DC ● Send to one replica in each remote DC ● All respond back to coordinator ○ block for consistency_level nodes ● Execute triggers
    • Tunable Consistency Coordinator blocks for specified replica count to respond Consistency Levels: ● ALL ● EACH_QUORUM ● LOCAL_QUORUM / LOCAL_ONE ● ONE / TWO / THREE ● ANY
    • Put Path - Riak ● ● ● ● ● Node gets request determine vnodes (preflist) if node not in preflist, forward run precommit hooks perform coordinating put ○ looks for previous riak object ○ increment vclock ● calls other preflist vnodes to put ● prepare return value (mult vclocks) ● coordinator vnode runs postcommit hooks
    • riak “consistency levels” ● n (n_val) - vnode replication count ● r - read count ● w - write count ○ {all | one | quorum | <int>} ● pr / pw - primary read/write count ● dw - durable write bucket defaults can be overriden on request
    • That was the happy path ... what about partitions?
    • Hinted Handoff - Riak Sloppy quorum ● preflist fallbacks to secondary vnodes ○ skip unavailable primary vnode ○ use next available vnode in ring ● Send data to vnode when available Put data is written, and available for reads
    • Hinted Handoff - Cassandra Coordinator stores hints ○ for unavailable nodes ○ if replica fails to respond Replay hints to node when available Mutation is stored, but data not read available CL.ANY - stores a hint if no replicas available
    • Read path - Riak ● ● ● ● coordinator gets request determine preflist send request to all vnodes in preflist when read count of vnodes return ○ merge values ○ possibly read repair
    • Read Repair - Riak ● compare vclocks for object ● if resolvable differences, ship newest object to out of date vnodes ● return object with latest vclock to client
    • Read Path - Cassandra ● Determine replicas to invoke ○ based on consistency level ● First replica responds with full data set, others send digests ● Coordinator waits for consistency_level nodes to respond
    • Consistent Read - Cassandra ● compare digest of columns from replicas ● If any mismatches: ○ re-request full data set from same replicas ○ compare full data sets, send updates ○ block until out of date replicas respond ● Return merged data set to client
    • Read Repair - Cassandra Converge requested data across all replicas Piggy-backs on normal reads, but waits for all replicas to respond (async) Follows same alg as consistent reads
    • Conflict resolution - Riak Vector clocks ● ● ● ● ● ● logical clock per object array of {incrementer, version, timestamp} maintains causal relationships safe in face of ‘concurrent’ writes performance penalty resolution burden pushed to caller
    • Conflict Resolution - Cassandra Last Writer Wins ● every column has timestamp value ● “whatever timestamp caller passed in” ● “What time is it?” ○ http://aphyr.com/posts/299-the-trouble-with-timestamps ● faster ● system resolves conflicts
    • Anti-entropy Converge cold data Merkle Tree exchange Stream inconsistencies IO/CPU intensive
    • Anti-entropy - Cassandra Node repair - converges ranges owned by a node with all replicas ● Initiator identifies peers ● Each participant reads range from disk, generates Merkle Tree, return MT ● Initiator compares all MTs ● Range exchange
    • Anti-Entropy - Riak AAE - conceptually similar to Cassandra Merkle Tree updated on every write Leaf nodes contain keys, not hash value Tree is rebuilt periodically Each execution only between two vnodes
    • Multi Datacenter support Cassandra ● in the box ● node interconnects are plain TCP scokets ○ two connections per node pair ● queries not restricted to local DC ○ read repair ○ node repair
    • Riak ● included in RiakEE (MDC) ● local nodes use disterl ● remote nodes use TCP ● queries do not span multiple regions ● repl types: ○ realtime ○ fullsync ● AAE
    • Riak @ Netflix (hypothetically)
    • Subscriber data c* = wide row implementation ● row key = custId (long) ● column per distinct attribute ○ ○ ○ ○ subscriberId name subscription details holds riak = fit reasonably well with JSON/text blob
    • Movie Ratings c* implementation: ● new ratings stored in individual columns ● recurring job to aggregate into JSON blob ● reads grab JSON + incremental updates Riak = JSON blob already, append new ratings
    • Viewing History Time-series of ‘viewable’ events ● one column per event ● playback/bookmark serialized JSON blob ● 7-8 months worth of playback data Riak - time-series data doesn’t feel like a natural fit
    • “Large blob” storage ● Team wanted to store images in c* ● key -> one column ● blob size Right in the wheelhouse for Riak/RiakCS
    • Operations Priam (https://github.com/Netflix/priam) ● backup/restore ● Cassandra bootstrap / token assignment ● configuration management ● supports multi-region deployments Every node decentralized from peers
    • Priam for Riak (perceived) challenges in supporting Riak ● some degree of centralization ○ cluster launch ○ backups for eLevelDb ● prod -> test refresh (riak reip) ● MDC
    • BI Integration Aegthithus - pipeline for importing into BI ● grab nightly backup snapshot for cluster ● convert to JSON ● merge, dedupe, find new data ● import into Hive, Teradata, etc Downside is (semi-) stale data into BI
    • Alternative BI Integration Live, secondary cluster ● C* - just another datacenter in cluster ● Riak - MultiDataCenter (MDC) sloution All mutations sent to secondary cluster ● what happens when things get slow? ● now part of c* repairs & riak full-sync
    • Wrap up ● Choice ● Cassandra and Riak are great databases ○ resilient to failure ○ flexible data modeling ○ strong communities ● Running databases in the cloud ain’t easy
    • Thank you, Basho! Mark Phillips, Jordan West, Joe Blomstedt, Andrew Thompson, @evanmcc, Basho Tech Support
    • Q & A time @jasobrown