• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
 

C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

on

  • 787 views

A discussion of the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.

A discussion of the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.

Statistics

Views

Total Views
787
Views on SlideShare
786
Embed Views
1

Actions

Likes
0
Downloads
18
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans Presentation Transcript

    • #Cassandra13Rethinking Topology in CassandraCassandra SummitJune 11, 2013Eric Evanseevans@opennms.com@jericevans
    • #Cassandra13DHT 101
    • #Cassandra13DHT 101partitioningAZ
    • #Cassandra13DHT 101partitioningAZBYC
    • #Cassandra13DHT 101partitioningAZBYCKey = Aaa
    • #Cassandra13DHT 101replica placementAZBYCKey = Aaa
    • #Cassandra13DHT 101consistencyConsistencyAvailabilityPartition tolerance
    • #Cassandra13DHT 101scenario: consistency level = oneA??W
    • #Cassandra13DHT 101scenario: consistency level = allA??R
    • #Cassandra13DHT 101scenario: quorum writeAB?W
    • #Cassandra13DHT 101scenario: quorum readAB?R
    • #Cassandra13Awesome, yes?
    • #Cassandra13Well...
    • #Cassandra13Problem:Poor request/stream distribution
    • #Cassandra13DistributionAZBYCM
    • #Cassandra13DistributionAZBYCM
    • #Cassandra13DistributionAZBYCM
    • #Cassandra13DistributionAZBYCM
    • #Cassandra13DistributionZ ABYCM
    • #Cassandra13DistributionABYCMA1Z
    • #Cassandra13DistributionABYCMA1Z
    • #Cassandra13DistributionABYCMA1Z
    • #Cassandra13Problem:Poor data distribution
    • #Cassandra13DistributionABDC
    • #Cassandra13DistributionABDCE
    • #Cassandra13DistributionEAD BC
    • #Cassandra13DistributionEAD BC
    • #Cassandra13DistributionABDCH EFG
    • #Cassandra13DistributionABDCH EFG
    • #Cassandra13Virtual Nodes
    • #Cassandra13In a nutshell...
    • #Cassandra13Benefits● Operationally simpler (no token management)● Better distribution of load● Concurrent streaming (all hosts)● Smaller partitions mean greater reliability● Better supports heterogeneous hardware
    • #Cassandra13Strategies● Automatic sharding● Fixed partition assignment● Random token assignment
    • #Cassandra13Strategyautomatic sharding● Partitions are split when data exceeds a threshold● Newly created partitions are relocated to a host with less data● Similar to Bigtable, or Mongo auto-sharding
    • #Cassandra13Strategyfixed partition assignment● Namespace divided into Q evenly-sized partitions● Q/N partitions assigned per host (where N is number of hosts)● Joining hosts “steal” partitions evenly from existing hosts● Used by Dynamo/Voldemort (“strategy 3” in Dynamo paper)
    • #Cassandra13Strategyrandom token assignment● Each host assigned T random tokens● T random tokens generated for joining hosts; New tokens divideexisting ranges● Similar to libketama; Identical to Classic Cassandra when T=1
    • #Cassandra13Considerations1.Number of partitions2.Partition size3.How 1 changes with more nodes and data4.How 2 changes with more nodes and data
    • #Cassandra13EvaluatingStrategy No. Partitions Partition sizeRandom O(N) O(B/N)Fixed O(1) O(B)Auto-sharding O(B) O(1)
    • #Cassandra13EvaluatingAutomatic sharding● Partition size is constant (great)● Number of partitions scales linearly with data size (bad)
    • #Cassandra13EvaluatingFixed partition assignment● Number of partitions is constant (good)● Partition size scales linearly with data size (bad)● Greater operational complexity (bad)
    • #Cassandra13EvaluatingRandom token assignment● Number of partitions scales linearly with number of hosts (OK)● Partition size increases with more data; Decreases with morehosts (good)
    • #Cassandra13Evaluating● Automatic sharding● Fixed partition assignment● Random token assignment
    • #Cassandra13Cassandra
    • #Cassandra13Configurationconf/cassandra.yaml# Comma separated list of tokens, (new# installs only).initial_token:<token>,<token>,<token>or# Number of tokens to generate.num_tokens: 256
    • #Cassandra13Configurationnodetool infoToken : (invoke with -T/--tokens to see all 256 tokens)ID : 6a8dc22c-1f37-473f-8f7e-47742f4b83a5Gossip active : trueThrift active : trueLoad : 42.92 MBGeneration No : 1370016307Uptime (seconds) : 221Heap Memory (MB) : 998.72 / 1886.00Data Center : datacenter1Rack : rack1Exceptions : 0Key Cache : size 1128 (bytes), capacity 98566144 (bytes), 42 hits, 54 re...Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN ...
    • #Cassandra13Configurationnodetool ringDatacenter: datacenter1==========Replicas: 0Address Rack Status State Load Owns Token3074457345618258602127.0.0.1 rack1 Up Normal 42.92 MB 33.33% -9223372036854775808127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3098476543630901247127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3122495741643543892127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3146514939656186537127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3170534137668829183127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3194553335681471828127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 321857253369411447127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3242591731706757118...
    • #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
    • #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
    • #Cassandra13Configurationnodetool statusDatacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
    • #Cassandra13MigrationABD
    • #Cassandra13Migrationedit conf/cassandra.yaml and restart# Number of tokens to generate.num_tokens: 256
    • #Cassandra13Migrationconvert to T contiguous tokens in existing rangesBAAAAAAAAAAAAAAAACAB
    • #Cassandra13MigrationshuffleBAAAAAAAAAAAAAAAACAB
    • #Cassandra13Shuffle● Range transfers are queued on each host● Hosts initiate transfer to self● Pay attention to the logs!
    • #Cassandra13ShuffleUsage: shuffle [options] <sub-command>Sub-commands:create Initialize a new shuffle operationls List pending relocationsclear Clear pending relocationsen[able] Enable shufflingdis[able] Disable shufflingOptions:-dc, --only-dc Apply only to named DC (create only)-u, --username JMX username-tp, --thrift-port Thrift port number (Default: 9160)-p, --port JMX port number (Default: 7199)-tf, --thrift-framed Enable framed transport for Thrift (Default: false)-en, --and-enable Immediately enable shuffling (create only)-pw, --password JMX password-H, --help Print help information-h, --host JMX hostname or IP address (Default: localhost)-th, --thrift-host Thrift hostname or IP address (Default: JMX host)
    • #Cassandra13Performance
    • #Cassandra13removenodeCassandra 1.2 Cassandra 1.1050100150200250300350400450
    • #Cassandra13bootstrapCassandra 1.2 Cassandra 1.10100200300400500600
    • #Cassandra13The End● Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshalland Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web.● Low, Richard. “Improving Cassandras uptime with virtual nodes” Web.● Overton, Sam. “Virtual Nodes Strategies.” Web.● Overton, Sam. “Virtual Nodes: Performance Results.” Web.● Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web.