Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Virtual Nodes: Rethinking Topology in Cassandra

3,632 views

Published on

A presentation on the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.

Published in: Technology, Business
  • Be the first to comment

Virtual Nodes: Rethinking Topology in Cassandra

  1. 1. #Cassandra13Rethinking Topology in CassandraCassandra SummitJune 11, 2013Eric Evanseevans@opennms.com@jericevans
  2. 2. #Cassandra13DHT 101
  3. 3. #Cassandra13DHT 101partitioningAZ
  4. 4. #Cassandra13DHT 101partitioningAZBYC
  5. 5. #Cassandra13DHT 101partitioningAZBYCKey = Aaa
  6. 6. #Cassandra13DHT 101replica placementAZBYCKey = Aaa
  7. 7. #Cassandra13DHT 101consistencyConsistencyAvailabilityPartition tolerance
  8. 8. #Cassandra13DHT 101scenario: consistency level = oneA??W
  9. 9. #Cassandra13DHT 101scenario: consistency level = allA??R
  10. 10. #Cassandra13DHT 101scenario: quorum writeAB?W
  11. 11. #Cassandra13DHT 101scenario: quorum readAB?R
  12. 12. #Cassandra13Awesome, yes?
  13. 13. #Cassandra13Well...
  14. 14. #Cassandra13Problem:Poor request/stream distribution
  15. 15. #Cassandra13DistributionAZBYCM
  16. 16. #Cassandra13DistributionAZBYCM
  17. 17. #Cassandra13DistributionAZBYCM
  18. 18. #Cassandra13DistributionAZBYCM
  19. 19. #Cassandra13DistributionZ ABYCM
  20. 20. #Cassandra13DistributionABYCMA1Z
  21. 21. #Cassandra13DistributionABYCMA1Z
  22. 22. #Cassandra13DistributionABYCMA1Z
  23. 23. #Cassandra13Problem:Poor data distribution
  24. 24. #Cassandra13DistributionABDC
  25. 25. #Cassandra13DistributionABDCE
  26. 26. #Cassandra13DistributionEAD BC
  27. 27. #Cassandra13DistributionEAD BC
  28. 28. #Cassandra13DistributionABDCH EFG
  29. 29. #Cassandra13DistributionABDCH EFG
  30. 30. #Cassandra13Virtual Nodes
  31. 31. #Cassandra13In a nutshell...
  32. 32. #Cassandra13Benefits● Operationally simpler (no token management)● Better distribution of load● Concurrent streaming (all hosts)● Smaller partitions mean greater reliability● Better supports heterogeneous hardware
  33. 33. #Cassandra13Strategies● Automatic sharding● Fixed partition assignment● Random token assignment
  34. 34. #Cassandra13Strategyautomatic sharding● Partitions are split when data exceeds a threshold● Newly created partitions are relocated to a host with less data● Similar to Bigtable, or Mongo auto-sharding
  35. 35. #Cassandra13Strategyfixed partition assignment● Namespace divided into Q evenly-sized partitions● Q/N partitions assigned per host (where N is number of hosts)● Joining hosts “steal” partitions evenly from existing hosts● Used by Dynamo/Voldemort (“strategy 3” in Dynamo paper)
  36. 36. #Cassandra13Strategyrandom token assignment● Each host assigned T random tokens● T random tokens generated for joining hosts; New tokens divideexisting ranges● Similar to libketama; Identical to Classic Cassandra when T=1
  37. 37. #Cassandra13Considerations1.Number of partitions2.Partition size3.How 1 changes with more nodes and data4.How 2 changes with more nodes and data
  38. 38. #Cassandra13EvaluatingStrategy No. Partitions Partition sizeRandom O(N) O(B/N)Fixed O(1) O(B)Auto-sharding O(B) O(1)
  39. 39. #Cassandra13EvaluatingAutomatic sharding● Partition size is constant (great)● Number of partitions scales linearly with data size (bad)
  40. 40. #Cassandra13EvaluatingFixed partition assignment● Number of partitions is constant (good)● Partition size scales linearly with data size (bad)● Greater operational complexity (bad)
  41. 41. #Cassandra13EvaluatingRandom token assignment● Number of partitions scales linearly with number of hosts (OK)● Partition size increases with more data; Decreases with morehosts (good)
  42. 42. #Cassandra13Evaluating● Automatic sharding● Fixed partition assignment● Random token assignment
  43. 43. #Cassandra13Cassandra
  44. 44. #Cassandra13Configurationconf/cassandra.yaml# Comma separated list of tokens, (new# installs only).initial_token:<token>,<token>,<token>or# Number of tokens to generate.num_tokens: 256
  45. 45. #Cassandra13Configurationnodetool infoToken : (invoke with -T/--tokens to see all 256 tokens)ID : 6a8dc22c-1f37-473f-8f7e-47742f4b83a5Gossip active : trueThrift active : trueLoad : 42.92 MBGeneration No : 1370016307Uptime (seconds) : 221Heap Memory (MB) : 998.72 / 1886.00Data Center : datacenter1Rack : rack1Exceptions : 0Key Cache : size 1128 (bytes), capacity 98566144 (bytes), 42 hits, 54 re...Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN ...
  46. 46. #Cassandra13Configurationnodetool ringDatacenter: datacenter1==========Replicas: 0Address Rack Status State Load Owns Token3074457345618258602127.0.0.1 rack1 Up Normal 42.92 MB 33.33% -9223372036854775808127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3098476543630901247127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3122495741643543892127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3146514939656186537127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3170534137668829183127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3194553335681471828127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 321857253369411447127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3242591731706757118...
  47. 47. #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
  48. 48. #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
  49. 49. #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
  50. 50. #Cassandra13MigrationABD
  51. 51. #Cassandra13Migrationedit conf/cassandra.yaml and restart# Number of tokens to generate.num_tokens: 256
  52. 52. #Cassandra13Migrationconvert to T contiguous tokens in existing rangesBAAAAAAAAAAAAAAAACAB
  53. 53. #Cassandra13MigrationshuffleBAAAAAAAAAAAAAAAACAB
  54. 54. #Cassandra13Shuffle● Range transfers are queued on each host● Hosts initiate transfer to self● Pay attention to the logs!
  55. 55. #Cassandra13ShuffleUsage: shuffle [options] <sub-command>Sub-commands:create Initialize a new shuffle operationls List pending relocationsclear Clear pending relocationsen[able] Enable shufflingdis[able] Disable shufflingOptions:-dc, --only-dc Apply only to named DC (create only)-u, --username JMX username-tp, --thrift-port Thrift port number (Default: 9160)-p, --port JMX port number (Default: 7199)-tf, --thrift-framed Enable framed transport for Thrift (Default: false)-en, --and-enable Immediately enable shuffling (create only)-pw, --password JMX password-H, --help Print help information-h, --host JMX hostname or IP address (Default: localhost)-th, --thrift-host Thrift hostname or IP address (Default: JMX host)
  56. 56. #Cassandra13Performance
  57. 57. #Cassandra13removenodeCassandra 1.2 Cassandra 1.1050100150200250300350400450
  58. 58. #Cassandra13bootstrapCassandra 1.2 Cassandra 1.10100200300400500600
  59. 59. #Cassandra13The End● Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshalland Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web.● Low, Richard. “Improving Cassandras uptime with virtual nodes” Web.● Overton, Sam. “Virtual Nodes Strategies.” Web.● Overton, Sam. “Virtual Nodes: Performance Results.” Web.● Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web.

×