Ricon2013 preso upload

  • 2,059 views
Uploaded on

Slides from my talk at riconwest on 2013-Oct-30. I compare Cassandra and Riak and present how, hypothetically, Riak could be used at Netflix.

Slides from my talk at riconwest on 2013-Oct-30. I compare Cassandra and Riak and present how, hypothetically, Riak could be used at Netflix.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,059
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
24
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Dynamic Dynamos: Riak and Cassandra Jason Brown, @jasobrown Senior Software Engineer, Netflix #riconwest 2013-Oct-30
  • 2. CHOICE
  • 3. Choice The whole "NoSQL movement" is really about choice. At scale there will never be a single solution that is best for everyone. @jtuple, “Absolute Consistency”, Riak ML, 2012-Jan-11 http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-January/007157.html
  • 4. whoami ● @Netflix, > 5 years ● Apache Cassandra committer ● wannabe distributed systems geek
  • 5. Netflix and Cassandra ● ● ● ● long-time Oracle shop Aug 2008 needed new db for cloud migration 2010 - selected cassandra ○ dynamo-style, masterless system ○ multi-datacenter support ○ written in Java
  • 6. Netflix’s C* Prod Deployment Production clusters > 65 Production nodes > 2300 Multi-region clusters > 40 Most regions used 4 (three clusters) Total data ~300 TB Largest cluster 288 nodes (actually, 576 nodes) Max reads/writes 300k rps / 1.3m wps
  • 7. Jason’s whiteboard, summer 2013 <image of white board here>
  • 8. Why Riak? ● Another dynamo-style system ● vector clocks ● not java ● not jvm
  • 9. Talk Structure 1. Comparisons a. Write/Read path b. Conflict resolution c. Anti-entropy d. Multiple Datacenter support 2. Riak @ Netflix
  • 10. Data modeling Riak ● key -> value Cassandra ● columnar layout ● row key with one to many columns
  • 11. Virtual Nodes ● Split hash ring into smaller chunks ● Physical node responsible for 1..n tokens Cassandra ● purely for routing Riak ● burrowed deep into the code base
  • 12. Write Path - Cassandra ● Coordinator gets request ● Determine replica nodes, in all DCs ● Send to all in local DC ● Send to one replica in each remote DC ● All respond back to coordinator ○ block for consistency_level nodes ● Execute triggers
  • 13. Tunable Consistency Coordinator blocks for specified replica count to respond Consistency Levels: ● ALL ● EACH_QUORUM ● LOCAL_QUORUM / LOCAL_ONE ● ONE / TWO / THREE ● ANY
  • 14. Put Path - Riak ● ● ● ● ● Node gets request determine vnodes (preflist) if node not in preflist, forward run precommit hooks perform coordinating put ○ looks for previous riak object ○ increment vclock ● calls other preflist vnodes to put ● prepare return value (mult vclocks) ● coordinator vnode runs postcommit hooks
  • 15. riak “consistency levels” ● n (n_val) - vnode replication count ● r - read count ● w - write count ○ {all | one | quorum | <int>} ● pr / pw - primary read/write count ● dw - durable write bucket defaults can be overriden on request
  • 16. That was the happy path ... what about partitions?
  • 17. Hinted Handoff - Riak Sloppy quorum ● preflist fallbacks to secondary vnodes ○ skip unavailable primary vnode ○ use next available vnode in ring ● Send data to vnode when available Put data is written, and available for reads
  • 18. Hinted Handoff - Cassandra Coordinator stores hints ○ for unavailable nodes ○ if replica fails to respond Replay hints to node when available Mutation is stored, but data not read available CL.ANY - stores a hint if no replicas available
  • 19. Read path - Riak ● ● ● ● coordinator gets request determine preflist send request to all vnodes in preflist when read count of vnodes return ○ merge values ○ possibly read repair
  • 20. Read Repair - Riak ● compare vclocks for object ● if resolvable differences, ship newest object to out of date vnodes ● return object with latest vclock to client
  • 21. Read Path - Cassandra ● Determine replicas to invoke ○ based on consistency level ● First replica responds with full data set, others send digests ● Coordinator waits for consistency_level nodes to respond
  • 22. Consistent Read - Cassandra ● compare digest of columns from replicas ● If any mismatches: ○ re-request full data set from same replicas ○ compare full data sets, send updates ○ block until out of date replicas respond ● Return merged data set to client
  • 23. Read Repair - Cassandra Converge requested data across all replicas Piggy-backs on normal reads, but waits for all replicas to respond (async) Follows same alg as consistent reads
  • 24. Conflict resolution - Riak Vector clocks ● ● ● ● ● ● logical clock per object array of {incrementer, version, timestamp} maintains causal relationships safe in face of ‘concurrent’ writes performance penalty resolution burden pushed to caller
  • 25. Conflict Resolution - Cassandra Last Writer Wins ● every column has timestamp value ● “whatever timestamp caller passed in” ● “What time is it?” ○ http://aphyr.com/posts/299-the-trouble-with-timestamps ● faster ● system resolves conflicts
  • 26. Anti-entropy Converge cold data Merkle Tree exchange Stream inconsistencies IO/CPU intensive
  • 27. Anti-entropy - Cassandra Node repair - converges ranges owned by a node with all replicas ● Initiator identifies peers ● Each participant reads range from disk, generates Merkle Tree, return MT ● Initiator compares all MTs ● Range exchange
  • 28. Anti-Entropy - Riak AAE - conceptually similar to Cassandra Merkle Tree updated on every write Leaf nodes contain keys, not hash value Tree is rebuilt periodically Each execution only between two vnodes
  • 29. Multi Datacenter support Cassandra ● in the box ● node interconnects are plain TCP scokets ○ two connections per node pair ● queries not restricted to local DC ○ read repair ○ node repair
  • 30. Riak ● included in RiakEE (MDC) ● local nodes use disterl ● remote nodes use TCP ● queries do not span multiple regions ● repl types: ○ realtime ○ fullsync ● AAE
  • 31. Riak @ Netflix (hypothetically)
  • 32. Subscriber data c* = wide row implementation ● row key = custId (long) ● column per distinct attribute ○ ○ ○ ○ subscriberId name subscription details holds riak = fit reasonably well with JSON/text blob
  • 33. Movie Ratings c* implementation: ● new ratings stored in individual columns ● recurring job to aggregate into JSON blob ● reads grab JSON + incremental updates Riak = JSON blob already, append new ratings
  • 34. Viewing History Time-series of ‘viewable’ events ● one column per event ● playback/bookmark serialized JSON blob ● 7-8 months worth of playback data Riak - time-series data doesn’t feel like a natural fit
  • 35. “Large blob” storage ● Team wanted to store images in c* ● key -> one column ● blob size Right in the wheelhouse for Riak/RiakCS
  • 36. Operations Priam (https://github.com/Netflix/priam) ● backup/restore ● Cassandra bootstrap / token assignment ● configuration management ● supports multi-region deployments Every node decentralized from peers
  • 37. Priam for Riak (perceived) challenges in supporting Riak ● some degree of centralization ○ cluster launch ○ backups for eLevelDb ● prod -> test refresh (riak reip) ● MDC
  • 38. BI Integration Aegthithus - pipeline for importing into BI ● grab nightly backup snapshot for cluster ● convert to JSON ● merge, dedupe, find new data ● import into Hive, Teradata, etc Downside is (semi-) stale data into BI
  • 39. Alternative BI Integration Live, secondary cluster ● C* - just another datacenter in cluster ● Riak - MultiDataCenter (MDC) sloution All mutations sent to secondary cluster ● what happens when things get slow? ● now part of c* repairs & riak full-sync
  • 40. Wrap up ● Choice ● Cassandra and Riak are great databases ○ resilient to failure ○ flexible data modeling ○ strong communities ● Running databases in the cloud ain’t easy
  • 41. Thank you, Basho! Mark Phillips, Jordan West, Joe Blomstedt, Andrew Thompson, @evanmcc, Basho Tech Support
  • 42. Q & A time @jasobrown