Virtual Nodes: Rethinking Topology in Cassandra

6,360 views

Published on

Published in: Technology

Virtual Nodes: Rethinking Topology in Cassandra

  1. 1. Rethinking Topology in Cassandra ApacheCon Europe November 7, 2012 Eric Evans eevans@acunu.com @jericevansWednesday, November 7, 12 1
  2. 2. DHT 101Wednesday, November 7, 12 2
  3. 3. DHT 101 partitioning Z AWednesday, November 7, 12 3The keyspace, a namespace encompassing all possible keys
  4. 4. DHT 101 partitioning Z A Y B CWednesday, November 7, 12 4The namespace is divided into N partitions (where N is the number of nodes). Partitions aremapped to nodes and placed evenly throughout the namespace.
  5. 5. DHT 101 partitioning Z A Y Key = Aaa B CWednesday, November 7, 12 5A record, stored by key, is positioned on the next node (working clockwise) from where itsorts in the namespace
  6. 6. DHT 101 replica placement Z A Y Key = Aaa B CWednesday, November 7, 12 6Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, butanything deterministic will work.
  7. 7. DHT 101 consistency Consistency Availability Partition toleranceWednesday, November 7, 12 7With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem;At any given point, we can only guarantee 2 of Consistency, Availability, and Partitiontolerance.
  8. 8. DHT 101 scenario: consistency level = one A W ? ?Wednesday, November 7, 12 8Writing at consistency level ONE provides very high availability, only one in 3 member nodesneed be up for write to succeed
  9. 9. DHT 101 scenario: consistency level = all A R ? ?Wednesday, November 7, 12 9If strong consistency is required, reads with consistency ALL can be used of writes performedat ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
  10. 10. DHT 101 scenario: quorum write A W R+W > N B ?Wednesday, November 7, 12 10Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  11. 11. DHT 101 scenario: quorum read ? R+W > N B R CWednesday, November 7, 12 11Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  12. 12. Awesome, yes?Wednesday, November 7, 12 12
  13. 13. Well...Wednesday, November 7, 12 13
  14. 14. Problem: Poor load distributionWednesday, November 7, 12 14
  15. 15. Distributing Load Z A Y B C MWednesday, November 7, 12 15B and C hold replicas of A
  16. 16. Distributing Load Z A Y B C MWednesday, November 7, 12 16A and B hold replicas of Z
  17. 17. Distributing Load Z A Y B C MWednesday, November 7, 12 17Z and A hold replicas of Y
  18. 18. Distributing Load Z A Y B C MWednesday, November 7, 12 18Disaster strikes!
  19. 19. Distributing Load Z A Y B C MWednesday, November 7, 12 19Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboringnodes
  20. 20. Distributing Load Z A A1 Y B C MWednesday, November 7, 12 20Solution: Replace/repair down node
  21. 21. Distributing Load Z A A1 Y B C MWednesday, November 7, 12 21Solution: Replace/repair down node
  22. 22. Distributing Load Z A A1 Y B C MWednesday, November 7, 12 22Neighboring nodes are needed to stream missing data to A; Results in even more load onneighboring nodes
  23. 23. Problem: Poor data distributionWednesday, November 7, 12 23
  24. 24. Distributing Data A C D BWednesday, November 7, 12 24Ideal distribution of keyspace
  25. 25. Distributing Data A E C D BWednesday, November 7, 12 25Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
  26. 26. Distributing Data A A E C D C D B BWednesday, November 7, 12 26Moving existing nodes means moving corresponding data; Not ideal
  27. 27. Distributing Data A A E C D C D B BWednesday, November 7, 12 27Moving existing nodes means moving corresponding data; Not ideal
  28. 28. Distributing Data A H E C D G F BWednesday, November 7, 12 28Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  29. 29. Distributing Data A H E C D G F BWednesday, November 7, 12 29Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  30. 30. Virtual NodesWednesday, November 7, 12 30
  31. 31. In a nutshell... host host hostWednesday, November 7, 12 31Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node(host)
  32. 32. Benefits • Operationally simpler (no token management) • Better distribution of load • Concurrent streaming involving all hosts • Smaller partitions mean greater reliability • Supports heterogenous hardwareWednesday, November 7, 12 32
  33. 33. Strategies • Automatic sharding • Fixed partition assignment • Random token assignmentWednesday, November 7, 12 33
  34. 34. Strategy Automatic Sharding • Partitions are split when data exceeds a threshold • Newly created partitions are relocated to a host with lower data load • Similar to sharding performed by Bigtable, or Mongo auto-shardingWednesday, November 7, 12 34
  35. 35. Strategy Fixed Partition Assignment • Namespace divided into Q evenly-sized partitions • Q/N partitions assigned per host (where N is the number of hosts) • Joining hosts “steal” partitions evenly from existing hosts. • Used by Dynamo and Voldemort (described in Dynamo paper as “strategy 3”)Wednesday, November 7, 12 35
  36. 36. Strategy Random Token Assignment • Each host assigned T random tokens • T random tokens generated for joining hosts; New tokens divide existing ranges • Similar to libketama; Identical to Classic Cassandra when T=1Wednesday, November 7, 12 36
  37. 37. Considerations 1. Number of partitions 2. Partition size 3. How 1 changes with more nodes and data 4. How 2 changes with more nodes and dataWednesday, November 7, 12 37
  38. 38. Evaluating Strategy No. Partitions Partition size Random O(N) O(B/N) Fixed O(1) O(B) Auto-sharding O(B) O(1) B ~ total data size, N ~ number of hostsWednesday, November 7, 12 38
  39. 39. Evaluating • Automatic sharding • partition size constant (great) • number of partitions scales linearly with data size (bad) • Fixed partition assignment • Random token assignmentWednesday, November 7, 12 39
  40. 40. Evaluating • Automatic sharding • Fixed partition assignment • Number of partitions is constant (good) • Partition size scales linearly with data size (bad) • Higher operational complexity (bad) • Random token assignmentWednesday, November 7, 12 40
  41. 41. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment • Number of partitions scales linearly with number of hosts (good ok) • Partition size increases with more data; decreases with more hosts (good)Wednesday, November 7, 12 41
  42. 42. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignmentWednesday, November 7, 12 42
  43. 43. CassandraWednesday, November 7, 12 43
  44. 44. Configuration conf/cassandra.yaml # Comma separated list of tokens, # (new installs only). initial_token:<token>,<token>,<token> or # Number of tokens to generate. num_tokens: 256Wednesday, November 7, 12 44Two params control how tokens are assigned. The initial_token param now optionallyaccepts a csv list, or (preferably) you can assign a numeric value to num_tokens
  45. 45. Configuration nodetool info Token : (invoke with -T/--tokens to see all 256 tokens) ID : 64090651-6034-41d5-bfc6-ddd24957f164 Gossip active : true Thrift active : true Load : 92.69 KB Generation No : 1351030018 Uptime (seconds): 45 Heap Memory (MB): 95.16 / 1956.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 Key Cache : size 240 (bytes), capacity 101711872 (bytes ... Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, ...Wednesday, November 7, 12 45To keep the output readable, nodetool info no longer displays tokens (if there are more thanone), unless the -T/--tokens argument is passed
  46. 46. Configuration nodetool ring Datacenter: datacenter1 ========== Replicas: 2 Address Rack Status State Load Owns Token 9022770486425350384 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9182469192098976078 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9054823614314102214 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8970752544645156769 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8927190060345427739 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8880475677109843259 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8817876497520861779 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8810512134942064901 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8661764562509480261 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8641550925069186492 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8636224350654790732 ... ...Wednesday, November 7, 12 46nodetool ring is still there, but the output is significantly more verbose, and it is less usefulas the go-to
  47. 47. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1Wednesday, November 7, 12 47New go-to command is nodetool status
  48. 48. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1Wednesday, November 7, 12 48Of note, since it is no longer practical to name a host by it’s token (because it can havemany), each host has a unique ID
  49. 49. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1Wednesday, November 7, 12 49Note the token per-node count
  50. 50. Migration A C BWednesday, November 7, 12 50
  51. 51. Migration edit conf/cassandra.yaml and restart # Number of tokens to generate. num_tokens: 256Wednesday, November 7, 12 51Step 1: Set num_tokens in cassandra.yaml, and restart node
  52. 52. Migration convert to T contiguous tokens in existing ranges A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A AWednesday, November 7, 12 52This will cause the existing range to be split into T contiguous tokens. This results in nochange to placement
  53. 53. Migration shuffle A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A AWednesday, November 7, 12 53Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
  54. 54. Shuffle • Range transfers are queued on each host • Hosts initiate transfer of ranges to self • Pay attention to the logs!Wednesday, November 7, 12 54
  55. 55. Shuffle bin/shuffle Usage: shuffle [options] <sub-command> Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host)Wednesday, November 7, 12 55
  56. 56. PerformanceWednesday, November 7, 12 56
  57. 57. removenode 400 300 200 100 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1Wednesday, November 7, 12 5717 node cluster of EC2 m1.large instances, 460M rows
  58. 58. bootstrap 500 375 250 125 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1Wednesday, November 7, 12 5817 node cluster of EC2 m1.large instances, 460M rows
  59. 59. The End • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web. • Low, Richard. “Improving Cassandras uptime with virtual nodes” Web. • Overton, Sam. “Virtual Nodes Strategies.” Web. • Overton, Sam. “Virtual Nodes: Performance Results.” Web. • Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web.Wednesday, November 7, 12 59

×