0
Dynamic Dynamos:
Riak and Cassandra
Jason Brown, @jasobrown
Senior Software Engineer, Netflix
#riconwest 2013-Oct-30
CHOICE
Choice
The whole "NoSQL movement" is really about
choice. At scale there will never be a single
solution that is best for ...
whoami
● @Netflix, > 5 years
● Apache Cassandra committer
● wannabe distributed systems geek
Netflix and Cassandra
●
●
●
●

long-time Oracle shop
Aug 2008
needed new db for cloud migration
2010 - selected cassandra
...
Netflix’s C* Prod Deployment
Production clusters

> 65

Production nodes

> 2300

Multi-region clusters

> 40

Most region...
Jason’s whiteboard, summer 2013
<image of white board here>
Why Riak?
● Another dynamo-style system
● vector clocks
● not java
● not jvm
Talk Structure
1. Comparisons
a. Write/Read path
b. Conflict resolution
c. Anti-entropy
d. Multiple Datacenter support

2....
Data modeling
Riak
● key -> value
Cassandra
● columnar layout
● row key with one to many columns
Virtual Nodes
● Split hash ring into smaller chunks
● Physical node responsible for 1..n tokens
Cassandra
● purely for rou...
Write Path - Cassandra
● Coordinator gets request
● Determine replica nodes, in all DCs
● Send to all in local DC
● Send t...
Tunable Consistency
Coordinator blocks for specified replica count to
respond
Consistency Levels:
● ALL
● EACH_QUORUM
● LO...
Put Path - Riak
●
●
●
●
●

Node gets request
determine vnodes (preflist)
if node not in preflist, forward
run precommit ho...
riak “consistency levels”
● n (n_val) - vnode replication count
● r - read count
● w - write count
○ {all | one | quorum |...
That was the happy path
...
what about partitions?
Hinted Handoff - Riak
Sloppy quorum
● preflist fallbacks to secondary vnodes
○ skip unavailable primary vnode
○ use next a...
Hinted Handoff - Cassandra
Coordinator stores hints
○ for unavailable nodes
○ if replica fails to respond

Replay hints to...
Read path - Riak
●
●
●
●

coordinator gets request
determine preflist
send request to all vnodes in preflist
when read cou...
Read Repair - Riak
● compare vclocks for object
● if resolvable differences, ship newest object
to out of date vnodes
● re...
Read Path - Cassandra
● Determine replicas to invoke
○ based on consistency level

● First replica responds with full data...
Consistent Read - Cassandra
● compare digest of columns from replicas
● If any mismatches:
○ re-request full data set from...
Read Repair - Cassandra
Converge requested data across all replicas
Piggy-backs on normal reads, but waits for all
replica...
Conflict resolution - Riak
Vector clocks
●
●
●
●
●
●

logical clock per object
array of {incrementer, version, timestamp}
...
Conflict Resolution - Cassandra
Last Writer Wins
● every column has timestamp value
● “whatever timestamp caller passed in...
Anti-entropy
Converge cold data
Merkle Tree exchange
Stream inconsistencies
IO/CPU intensive
Anti-entropy - Cassandra
Node repair - converges ranges owned by a
node with all replicas
● Initiator identifies peers
● E...
Anti-Entropy - Riak
AAE - conceptually similar to Cassandra
Merkle Tree updated on every write
Leaf nodes contain keys, no...
Multi Datacenter support
Cassandra
● in the box
● node interconnects are plain TCP scokets
○ two connections per node pair...
Riak
● included in RiakEE (MDC)
● local nodes use disterl
● remote nodes use TCP
● queries do not span multiple regions
● ...
Riak @ Netflix
(hypothetically)
Subscriber data
c* = wide row implementation
● row key = custId (long)
● column per distinct attribute
○
○
○
○

subscriber...
Movie Ratings
c* implementation:
● new ratings stored in individual columns
● recurring job to aggregate into JSON blob
● ...
Viewing History
Time-series of ‘viewable’ events
● one column per event
● playback/bookmark serialized JSON blob
● 7-8 mon...
“Large blob” storage
● Team wanted to store images in c*
● key -> one column
● blob size
Right in the wheelhouse for Riak/...
Operations
Priam (https://github.com/Netflix/priam)
● backup/restore
● Cassandra bootstrap / token assignment
● configurat...
Priam for Riak
(perceived) challenges in supporting Riak
● some degree of centralization
○ cluster launch
○ backups for eL...
BI Integration
Aegthithus - pipeline for importing into BI
● grab nightly backup snapshot for cluster
● convert to JSON
● ...
Alternative BI Integration
Live, secondary cluster
● C* - just another datacenter in cluster
● Riak - MultiDataCenter (MDC...
Wrap up
● Choice
● Cassandra and Riak are great databases
○ resilient to failure
○ flexible data modeling
○ strong communi...
Thank you, Basho!
Mark Phillips, Jordan West, Joe Blomstedt,
Andrew Thompson, @evanmcc,
Basho Tech Support
Q & A time

@jasobrown
Ricon2013 preso upload
Ricon2013 preso upload
Ricon2013 preso upload
Ricon2013 preso upload
Ricon2013 preso upload
Upcoming SlideShare
Loading in...5
×

Ricon2013 preso upload

2,514

Published on

Slides from my talk at riconwest on 2013-Oct-30. I compare Cassandra and Riak and present how, hypothetically, Riak could be used at Netflix.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,514
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Ricon2013 preso upload"

  1. 1. Dynamic Dynamos: Riak and Cassandra Jason Brown, @jasobrown Senior Software Engineer, Netflix #riconwest 2013-Oct-30
  2. 2. CHOICE
  3. 3. Choice The whole "NoSQL movement" is really about choice. At scale there will never be a single solution that is best for everyone. @jtuple, “Absolute Consistency”, Riak ML, 2012-Jan-11 http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-January/007157.html
  4. 4. whoami ● @Netflix, > 5 years ● Apache Cassandra committer ● wannabe distributed systems geek
  5. 5. Netflix and Cassandra ● ● ● ● long-time Oracle shop Aug 2008 needed new db for cloud migration 2010 - selected cassandra ○ dynamo-style, masterless system ○ multi-datacenter support ○ written in Java
  6. 6. Netflix’s C* Prod Deployment Production clusters > 65 Production nodes > 2300 Multi-region clusters > 40 Most regions used 4 (three clusters) Total data ~300 TB Largest cluster 288 nodes (actually, 576 nodes) Max reads/writes 300k rps / 1.3m wps
  7. 7. Jason’s whiteboard, summer 2013 <image of white board here>
  8. 8. Why Riak? ● Another dynamo-style system ● vector clocks ● not java ● not jvm
  9. 9. Talk Structure 1. Comparisons a. Write/Read path b. Conflict resolution c. Anti-entropy d. Multiple Datacenter support 2. Riak @ Netflix
  10. 10. Data modeling Riak ● key -> value Cassandra ● columnar layout ● row key with one to many columns
  11. 11. Virtual Nodes ● Split hash ring into smaller chunks ● Physical node responsible for 1..n tokens Cassandra ● purely for routing Riak ● burrowed deep into the code base
  12. 12. Write Path - Cassandra ● Coordinator gets request ● Determine replica nodes, in all DCs ● Send to all in local DC ● Send to one replica in each remote DC ● All respond back to coordinator ○ block for consistency_level nodes ● Execute triggers
  13. 13. Tunable Consistency Coordinator blocks for specified replica count to respond Consistency Levels: ● ALL ● EACH_QUORUM ● LOCAL_QUORUM / LOCAL_ONE ● ONE / TWO / THREE ● ANY
  14. 14. Put Path - Riak ● ● ● ● ● Node gets request determine vnodes (preflist) if node not in preflist, forward run precommit hooks perform coordinating put ○ looks for previous riak object ○ increment vclock ● calls other preflist vnodes to put ● prepare return value (mult vclocks) ● coordinator vnode runs postcommit hooks
  15. 15. riak “consistency levels” ● n (n_val) - vnode replication count ● r - read count ● w - write count ○ {all | one | quorum | <int>} ● pr / pw - primary read/write count ● dw - durable write bucket defaults can be overriden on request
  16. 16. That was the happy path ... what about partitions?
  17. 17. Hinted Handoff - Riak Sloppy quorum ● preflist fallbacks to secondary vnodes ○ skip unavailable primary vnode ○ use next available vnode in ring ● Send data to vnode when available Put data is written, and available for reads
  18. 18. Hinted Handoff - Cassandra Coordinator stores hints ○ for unavailable nodes ○ if replica fails to respond Replay hints to node when available Mutation is stored, but data not read available CL.ANY - stores a hint if no replicas available
  19. 19. Read path - Riak ● ● ● ● coordinator gets request determine preflist send request to all vnodes in preflist when read count of vnodes return ○ merge values ○ possibly read repair
  20. 20. Read Repair - Riak ● compare vclocks for object ● if resolvable differences, ship newest object to out of date vnodes ● return object with latest vclock to client
  21. 21. Read Path - Cassandra ● Determine replicas to invoke ○ based on consistency level ● First replica responds with full data set, others send digests ● Coordinator waits for consistency_level nodes to respond
  22. 22. Consistent Read - Cassandra ● compare digest of columns from replicas ● If any mismatches: ○ re-request full data set from same replicas ○ compare full data sets, send updates ○ block until out of date replicas respond ● Return merged data set to client
  23. 23. Read Repair - Cassandra Converge requested data across all replicas Piggy-backs on normal reads, but waits for all replicas to respond (async) Follows same alg as consistent reads
  24. 24. Conflict resolution - Riak Vector clocks ● ● ● ● ● ● logical clock per object array of {incrementer, version, timestamp} maintains causal relationships safe in face of ‘concurrent’ writes performance penalty resolution burden pushed to caller
  25. 25. Conflict Resolution - Cassandra Last Writer Wins ● every column has timestamp value ● “whatever timestamp caller passed in” ● “What time is it?” ○ http://aphyr.com/posts/299-the-trouble-with-timestamps ● faster ● system resolves conflicts
  26. 26. Anti-entropy Converge cold data Merkle Tree exchange Stream inconsistencies IO/CPU intensive
  27. 27. Anti-entropy - Cassandra Node repair - converges ranges owned by a node with all replicas ● Initiator identifies peers ● Each participant reads range from disk, generates Merkle Tree, return MT ● Initiator compares all MTs ● Range exchange
  28. 28. Anti-Entropy - Riak AAE - conceptually similar to Cassandra Merkle Tree updated on every write Leaf nodes contain keys, not hash value Tree is rebuilt periodically Each execution only between two vnodes
  29. 29. Multi Datacenter support Cassandra ● in the box ● node interconnects are plain TCP scokets ○ two connections per node pair ● queries not restricted to local DC ○ read repair ○ node repair
  30. 30. Riak ● included in RiakEE (MDC) ● local nodes use disterl ● remote nodes use TCP ● queries do not span multiple regions ● repl types: ○ realtime ○ fullsync ● AAE
  31. 31. Riak @ Netflix (hypothetically)
  32. 32. Subscriber data c* = wide row implementation ● row key = custId (long) ● column per distinct attribute ○ ○ ○ ○ subscriberId name subscription details holds riak = fit reasonably well with JSON/text blob
  33. 33. Movie Ratings c* implementation: ● new ratings stored in individual columns ● recurring job to aggregate into JSON blob ● reads grab JSON + incremental updates Riak = JSON blob already, append new ratings
  34. 34. Viewing History Time-series of ‘viewable’ events ● one column per event ● playback/bookmark serialized JSON blob ● 7-8 months worth of playback data Riak - time-series data doesn’t feel like a natural fit
  35. 35. “Large blob” storage ● Team wanted to store images in c* ● key -> one column ● blob size Right in the wheelhouse for Riak/RiakCS
  36. 36. Operations Priam (https://github.com/Netflix/priam) ● backup/restore ● Cassandra bootstrap / token assignment ● configuration management ● supports multi-region deployments Every node decentralized from peers
  37. 37. Priam for Riak (perceived) challenges in supporting Riak ● some degree of centralization ○ cluster launch ○ backups for eLevelDb ● prod -> test refresh (riak reip) ● MDC
  38. 38. BI Integration Aegthithus - pipeline for importing into BI ● grab nightly backup snapshot for cluster ● convert to JSON ● merge, dedupe, find new data ● import into Hive, Teradata, etc Downside is (semi-) stale data into BI
  39. 39. Alternative BI Integration Live, secondary cluster ● C* - just another datacenter in cluster ● Riak - MultiDataCenter (MDC) sloution All mutations sent to secondary cluster ● what happens when things get slow? ● now part of c* repairs & riak full-sync
  40. 40. Wrap up ● Choice ● Cassandra and Riak are great databases ○ resilient to failure ○ flexible data modeling ○ strong communities ● Running databases in the cloud ain’t easy
  41. 41. Thank you, Basho! Mark Phillips, Jordan West, Joe Blomstedt, Andrew Thompson, @evanmcc, Basho Tech Support
  42. 42. Q & A time @jasobrown
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×