Virtual Nodes: Rethinking Topology in Cassandra

3,244 views
3,023 views

Published on

A presentation on the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.

Published in: Technology, Business

Virtual Nodes: Rethinking Topology in Cassandra

  1. 1. #Cassandra13Rethinking Topology in CassandraCassandra SummitJune 11, 2013Eric Evanseevans@opennms.com@jericevans
  2. 2. #Cassandra13DHT 101
  3. 3. #Cassandra13DHT 101partitioningAZ
  4. 4. #Cassandra13DHT 101partitioningAZBYC
  5. 5. #Cassandra13DHT 101partitioningAZBYCKey = Aaa
  6. 6. #Cassandra13DHT 101replica placementAZBYCKey = Aaa
  7. 7. #Cassandra13DHT 101consistencyConsistencyAvailabilityPartition tolerance
  8. 8. #Cassandra13DHT 101scenario: consistency level = oneA??W
  9. 9. #Cassandra13DHT 101scenario: consistency level = allA??R
  10. 10. #Cassandra13DHT 101scenario: quorum writeAB?W
  11. 11. #Cassandra13DHT 101scenario: quorum readAB?R
  12. 12. #Cassandra13Awesome, yes?
  13. 13. #Cassandra13Well...
  14. 14. #Cassandra13Problem:Poor request/stream distribution
  15. 15. #Cassandra13DistributionAZBYCM
  16. 16. #Cassandra13DistributionAZBYCM
  17. 17. #Cassandra13DistributionAZBYCM
  18. 18. #Cassandra13DistributionAZBYCM
  19. 19. #Cassandra13DistributionZ ABYCM
  20. 20. #Cassandra13DistributionABYCMA1Z
  21. 21. #Cassandra13DistributionABYCMA1Z
  22. 22. #Cassandra13DistributionABYCMA1Z
  23. 23. #Cassandra13Problem:Poor data distribution
  24. 24. #Cassandra13DistributionABDC
  25. 25. #Cassandra13DistributionABDCE
  26. 26. #Cassandra13DistributionEAD BC
  27. 27. #Cassandra13DistributionEAD BC
  28. 28. #Cassandra13DistributionABDCH EFG
  29. 29. #Cassandra13DistributionABDCH EFG
  30. 30. #Cassandra13Virtual Nodes
  31. 31. #Cassandra13In a nutshell...
  32. 32. #Cassandra13Benefits● Operationally simpler (no token management)● Better distribution of load● Concurrent streaming (all hosts)● Smaller partitions mean greater reliability● Better supports heterogeneous hardware
  33. 33. #Cassandra13Strategies● Automatic sharding● Fixed partition assignment● Random token assignment
  34. 34. #Cassandra13Strategyautomatic sharding● Partitions are split when data exceeds a threshold● Newly created partitions are relocated to a host with less data● Similar to Bigtable, or Mongo auto-sharding
  35. 35. #Cassandra13Strategyfixed partition assignment● Namespace divided into Q evenly-sized partitions● Q/N partitions assigned per host (where N is number of hosts)● Joining hosts “steal” partitions evenly from existing hosts● Used by Dynamo/Voldemort (“strategy 3” in Dynamo paper)
  36. 36. #Cassandra13Strategyrandom token assignment● Each host assigned T random tokens● T random tokens generated for joining hosts; New tokens divideexisting ranges● Similar to libketama; Identical to Classic Cassandra when T=1
  37. 37. #Cassandra13Considerations1.Number of partitions2.Partition size3.How 1 changes with more nodes and data4.How 2 changes with more nodes and data
  38. 38. #Cassandra13EvaluatingStrategy No. Partitions Partition sizeRandom O(N) O(B/N)Fixed O(1) O(B)Auto-sharding O(B) O(1)
  39. 39. #Cassandra13EvaluatingAutomatic sharding● Partition size is constant (great)● Number of partitions scales linearly with data size (bad)
  40. 40. #Cassandra13EvaluatingFixed partition assignment● Number of partitions is constant (good)● Partition size scales linearly with data size (bad)● Greater operational complexity (bad)
  41. 41. #Cassandra13EvaluatingRandom token assignment● Number of partitions scales linearly with number of hosts (OK)● Partition size increases with more data; Decreases with morehosts (good)
  42. 42. #Cassandra13Evaluating● Automatic sharding● Fixed partition assignment● Random token assignment
  43. 43. #Cassandra13Cassandra
  44. 44. #Cassandra13Configurationconf/cassandra.yaml# Comma separated list of tokens, (new# installs only).initial_token:<token>,<token>,<token>or# Number of tokens to generate.num_tokens: 256
  45. 45. #Cassandra13Configurationnodetool infoToken : (invoke with -T/--tokens to see all 256 tokens)ID : 6a8dc22c-1f37-473f-8f7e-47742f4b83a5Gossip active : trueThrift active : trueLoad : 42.92 MBGeneration No : 1370016307Uptime (seconds) : 221Heap Memory (MB) : 998.72 / 1886.00Data Center : datacenter1Rack : rack1Exceptions : 0Key Cache : size 1128 (bytes), capacity 98566144 (bytes), 42 hits, 54 re...Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN ...
  46. 46. #Cassandra13Configurationnodetool ringDatacenter: datacenter1==========Replicas: 0Address Rack Status State Load Owns Token3074457345618258602127.0.0.1 rack1 Up Normal 42.92 MB 33.33% -9223372036854775808127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3098476543630901247127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3122495741643543892127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3146514939656186537127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3170534137668829183127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3194553335681471828127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 321857253369411447127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3242591731706757118...
  47. 47. #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
  48. 48. #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
  49. 49. #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
  50. 50. #Cassandra13MigrationABD
  51. 51. #Cassandra13Migrationedit conf/cassandra.yaml and restart# Number of tokens to generate.num_tokens: 256
  52. 52. #Cassandra13Migrationconvert to T contiguous tokens in existing rangesBAAAAAAAAAAAAAAAACAB
  53. 53. #Cassandra13MigrationshuffleBAAAAAAAAAAAAAAAACAB
  54. 54. #Cassandra13Shuffle● Range transfers are queued on each host● Hosts initiate transfer to self● Pay attention to the logs!
  55. 55. #Cassandra13ShuffleUsage: shuffle [options] <sub-command>Sub-commands:create Initialize a new shuffle operationls List pending relocationsclear Clear pending relocationsen[able] Enable shufflingdis[able] Disable shufflingOptions:-dc, --only-dc Apply only to named DC (create only)-u, --username JMX username-tp, --thrift-port Thrift port number (Default: 9160)-p, --port JMX port number (Default: 7199)-tf, --thrift-framed Enable framed transport for Thrift (Default: false)-en, --and-enable Immediately enable shuffling (create only)-pw, --password JMX password-H, --help Print help information-h, --host JMX hostname or IP address (Default: localhost)-th, --thrift-host Thrift hostname or IP address (Default: JMX host)
  56. 56. #Cassandra13Performance
  57. 57. #Cassandra13removenodeCassandra 1.2 Cassandra 1.1050100150200250300350400450
  58. 58. #Cassandra13bootstrapCassandra 1.2 Cassandra 1.10100200300400500600
  59. 59. #Cassandra13The End● Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshalland Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web.● Low, Richard. “Improving Cassandras uptime with virtual nodes” Web.● Overton, Sam. “Virtual Nodes Strategies.” Web.● Overton, Sam. “Virtual Nodes: Performance Results.” Web.● Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web.

×