JOINING A p2p CONVERSATION
JOAQUIN CASARES, THE LAST PICKLE
JOINING A p2p CONVERSATION
JOAQUIN CASARES
▸ The Last Pickle
▸ Consultant
▸ Previously:
▸ Umbel
▸ Software Engineer
▸ Riptano/DataStax
▸ Support Engineer
▸ Software Engineer-in-Test
▸ Demo Engineer
JOINING A p2p CONVERSATION
THE LAST PICKLE
▸ 50+ years combined experience with Apache Cassandra.
▸ We communicate ideas.
▸ We are committed to doing the right thing for both our team of experts and our clients.
▸ Our passion for sharing our knowledge is present in all that we do.
▸ Consider us a member of your team.
▸ Ultimately:
▸ We want you to be successful and have all the information to do so.
WHERE DO WE GO
FROM HERE?
OVERVIEW
Overview
▸ p2p Networks.
▸ Cassandra fundamentals.
▸ How to add capacity.
▸ How to check on the status.
▸ Things you shouldn't forget, I think.
▸ How to forget.
DEFINITION: P2P
KaZaA TO BITTORRENT
KaZaA TO BITTORRENT
KaZaA: "P2P"
PEER
SUPERNODE
KaZaA TO BITTORRENT
KaZaA: "CENTRALIZED P2P"
PEER
SUPERNODE
KAZAA.COM
KaZaA TO BITTORRENT
KaZaA: SHUTDOWN
PEER
SUPERNODE
KAZAA.COM
CAN ANYONE HERE ME?
IS ANYONE ALIVE OUT THERE?
NO.
KaZaA TO BITTORRENT
BITTORRENT: A REAL DECENTRALIZED, DISTRIBUTED P2P NETWORK
PEER
TRACKER
SELF
KaZaA TO BITTORRENT
BITTORRENT: A REAL DECENTRALIZED, DISTRIBUTED P2P NETWORK
PEER
TRACKER
SELF
MY CONNECTION WAS SEVERED.
KaZaA TO BITTORRENT
BITTORRENT: A REAL DECENTRALIZED, DISTRIBUTED P2P NETWORK
PEER
TRACKER
SELF
I'M BANNED BY A STATE ACTOR.
I KNOW THAT HASH.
CASSANDRA TOKENS
CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T
CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T
M
G
A
T
CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T
M
G
A
T
(T, A]
(G, M]
(A, G](M, T]
CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T
M
G
A
T
(T, A]
(A, G]
(G, M]
(M, T]
CASSANDRA TOKENS
VIRTUAL NODE ("VNODES") OWNERSHIP
A D G J
3
2
1
4
M Q T W
1 3 2 1 3 2 4 4
CASSANDRA TOKENS
VIRTUAL NODE ("VNODES") OWNERSHIP
A D G J
3
2
1
4
M Q T W
1 3 2 1 3 2 4 4
CASSANDRA TOKENS
VNODES - JOINING A NODE
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
USE 32-64 VNODES.
NOT: DEFAULT OF 256 VNODES / NODE.
The Last Pickle
CASSANDRA TOKENS
USE THE SAME NUM_TOKENS COUNT
ACROSS ALL MACHINES.
The Last Pickle
CASSANDRA TOKENS
CASSANDRA TOKENS
TIDBIT: MD5 TOKENS TO MURMUR3 TOKENS
▸ MD5 token range: 0 to 2127
▸ Murmur3 token range: -263
to 263
-1
▸ Murmur3:
▸ Better randomness
▸ Lower chance of collisions
▸ ~6x faster than MD5
SEED NODES
SEED NODES
CASSANDRA: P2P IMPLEMENTATION - JOINING
PEER
SEED
JOINING
CAN I JOIN THE PARTY?
HEY BOB, WE HAVE A NEW
FRIEND.
I'M BOB.
SEED NODES
CASSANDRA: P2P IMPLEMENTATION
PEER
SEED
NODE
HAVE SOME OF MY TOKENS.
I DIDN'T WANT THESE RANGES
ANYWAY.
SEED NODES
CASSANDRA: P2P IMPLEMENTATION - SEED FAILURE
PEER
SEED
NODE
IT'S OK. I KNOW WHO MY
FRIENDS ARE.
SEED NODES
CASSANDRA: P2P IMPLEMENTATION - GOSSIP
PEER
SEED
NODE
I'M BOB.
HEY BOB, I'M STILL HERE.
SEED NODES
TIDBIT: GOSSIP
▸ It works.
▸ Basically: echoes.
▸ CDN Gossip use case.
SEED NODES
CASSANDRA: P2P IMPLEMENTATION - FAILED JOINING
PEER
SEED
FAILED JOIN
NONE OF THESE ADDRESSES
WORK. WHERE'S THE PARTY?
SEED NODES
CASSANDRA: P2P IMPLEMENTATION - FAILED JOINING OF SEED NODE
PEER
SEED
JOINING SEED
SO... WAS IT THE CHICKEN?
OR THE EGG?
USE THE SAME SEED NODES
THROUGHOUT THE CLUSTER.
The Last Pickle
SEED NODES
3-5 SEED NODES PER DATA CENTER.
The Last Pickle
SEED NODES
BOOTSTRAP
BOOTSTRAP
ANNOUNCING
PEER
SEED
JOINING
I'M JOINING!
BOOTSTRAP
STREAMING
PEER
SEED
JOINING
THANKS FOR ALL THE DATA!
BOOTSTRAP
REPLICA OWNERSHIP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
BOOTSTRAP
REPLICA OWNERSHIP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
C
D
A
BOOTSTRAP
REPLICA OWNERSHIP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
C
D
A
Q
P
O
N
M
3 2
BOOTSTRAP
POST-BOOTSTRAP: CLEANUP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
C
D
A
Q
P
O
N
P
O
N
B
I REALLY DON'T NEED THIS
EXTRA PRESSURE P, O, & N.
SEE YOU, C!
M
BOOTSTRAP
POST-BOOTSTRAP: CLEANUP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
D
A
QP
I AM NOW RESPONSIBLE FOR BP.
M
BOOTSTRAP
WHEN TO USE THE BOOTSTRAP PROCESS
▸ Prerequisite: Is everything UN (Up/Normal)?
▸ nodetool status
BOOTSTRAP
WHEN TO USE THE BOOTSTRAP PROCESS
▸ Are you hitting a disk capacity issue?
▸ df -h
BOOTSTRAP
WHEN TO USE THE BOOTSTRAP PROCESS
▸ Are you hitting CPU capacity limits?
▸ top/htop
BOOTSTRAP
WHEN TO USE THE BOOTSTRAP PROCESS
▸ Does request latency have room for improvement?
▸ nodetool cfstats
▸ nodetool tablestats on newer versions of Cassandra.
BOOTSTRAP
WHEN TO USE THE BOOTSTRAP PROCESS
▸ Do you want to split up token hot spots?
▸ nodetool status $KEYSPACE
BOOTSTRAP
WHEN TO USE THE BOOTSTRAP PROCESS
▸ Is the prerequisite met and YES to any of the other questions? Then bootstrap!
BOOTSTRAP
PLAY IT SAFE
▸ UJ (Up/Joining) is an ephemeral state after ~2 minutes, but wait 5 minutes.
▸ UN (Up/Normal) is a persistent state.
▸ With Cassandra 2.2+, we have nodetool bootstrap resume, or simply restarting
the node.
▸ With Cassandra pre-2.2, we must clear:
▸ data_file_directories
▸ commitlog_directory
▸ saved_caches_directory
IF "POPCORN" JOINING:
NODETOOL NETSTATS
The Last Pickle
BOOTSTRAP
BOOTSTRAP
DON'TS
▸ Seed nodes cannot be bootstrapped.
▸ No need to include auto_bootstrap parameter in the cassandra.yaml.
▸ Do not join more than one node per rack concurrently.
▸ Do not start two bootstrap processes within 2 minutes of each other.
▸ Do not bootstrap a node when there are more than 0 nodes offline.
▸ The JVM option -Dcassandra.consistent.rangemovement=false can be used to override the default behavior.
▸ Will require a follow-up rolling repair.
▸ Do not bootstrap a node running a different version of Cassandra.
▸ Do not bootstrap new nodes into a mixed-version cluster.
▸ Don't forget about your racks! More on that later...
BOOTSTRAP
TIDBIT: CLEANUP
▸ Cleanup removes stale replicas off a node.
▸ Acts like a single-SSTable compaction.
▸ Does not need to be run manually, since any followup compactions will remove
stale data.
▸ Is useful when disk capacity is at it's limit.
KEEP FREE DISK SPACE UNDER 50% TO
ALLOW FOR NORMAL COMPACTIONS TO
COMPLETE SUCCESSFULLY.
The Last Pickle
BOOTSTRAP
REPLACE_ADDRESS
REPLACE_ADDRESS
REPLACING
PEER
SEED
DEAD
REPLACING
I'M REPLACING YOU JOHN!
JOHN, ARE YOU AROUND?
REPLACE_ADDRESS
REPLACING
PEER
SEED
REPLACING
HEY FELLOWS, I'M TINA,
THE NEW JOHN.
HEY TINA, HERE'S THE DATA
JOHN SHOULD HAVE HAD.
FOR IMPORTANT DATA:
USE AN ODD-NUMBERED
REPLICATION_FACTOR > 1.
The Last Pickle
REPLACE_ADDRESS
FOR IMPORTANT DATA:
WRITE AT A CONSISTENCY LEVEL OF
LOCAL_QUORUM OR HIGHER.
The Last Pickle
REPLACE_ADDRESS
TO AVOID STALE DATA:
READ AT A CONSISTENCY LEVEL OF
LOCAL_QUORUM OR HIGHER.
The Last Pickle
REPLACE_ADDRESS
REPLACE_ADDRESS
TIDBITS: REPLACE_ADDRESS
▸ If using cl.ONE, you might have a bad time.
▸ But:
▸ Remember to respect max_hint_window_in_ms, which defaults to 3 hours
and starts as soon as the original node is knocked offline.
▸ If the hinted handoff window is missed, a rolling repair may be needed.
▸ Use -Dcassandra.replace_address_first_boot=<IP_ADDRESS> to
prevent possible issues if you forget to remove the flag.
REPLACE_ADDRESS
TIDBITS: REPLACE_ADDRESS - EXPANDED
▸ If not using a consistency level higher than ONE, stale data or data loss is possible and likely in the event of a node
failure.
▸ But:
▸ Hinted handoff may prevent stale data or data loss if a new node is in the UN (Up/Normal) state within the
max_hint_window_in_ms, which defaults to 3 hours and starts as soon as the original node is knocked offline.
▸ If the hinted handoff window is missed, and if running with a replication factor > 1, and other replicas
received the newer mutation, a rolling repair will make the new replica consistent.
▸ Read-repair may prevent stale data from being returned and repair stale partitions, but the
dclocal_read_repair_chance table schema parameter defaults to 10% of all requests.
▸ Use -Dcassandra.replace_address_first_boot=<IP_ADDRESS> to prevent possible issues if you forget to
remove the flag.
REPLACE_ADDRESS
WHEN TO USE THE REPLACE_ADDRESS PROCESS
▸ Prerequisite: Is the node unrecoverable/inaccessible?
▸ Are the current snapshots too stale?
▸ Most times snapshots are used for disaster scenarios.
▸ Is there enough time to replace the old node and run a repair before
gc_grace_seconds, which defaults to 10 days?
▸ If not, use nodetool removetoken.
▸ Is YES the response to all of the above questions? Then replace_address.
REBUILD
REBUILD_ADDRESS EASY WAY TO SWAP
TO NEW HARDWARE.
The Last Pickle
REBUILD
REBUILD
JOINING
PEER
SEED
JOINING
I'M BOB.
BUT DON'T SEND US YOUR DATA.
WE'LL ASK WHEN READY.
I'M JULIA.
REBUILD
REBUILD
PEER
SEED
REBUILDING
LET'S DO THIS!
HOW MUCH LONGER, REALLY?
WOW, THOSE ARE FEW GIGS!
OK, READY NOW.
REBUILD
REBUILD
PEER
SEED
REBUILDING
LET'S DO THIS!
HOW MUCH LONGER, REALLY?
WOW, THOSE ARE FEW GIGS!
OK, READY NOW.
REBUILD
PROCESS: GETTING THINGS SORTED
▸ Prerequisite: Use the NetworkTopologyStrategy, instead of the SimpleStrategy.
▸ Prerequisite: Use a DCAwarePolicy with the Cassandra driver to restrict the contact points.
▸ Prerequisite: Use LOCAL_QUORUM and LOCAL_ONE consistency levels to restrict
requests to the specific data center.
▸ Do not define a replication strategy to the new new data center at first.
▸ Bootstrap all intended nodes into the new data center.
▸ Because there is no replication strategies that mention the new data center, this should
cause almost no streaming tasks.
REBUILD
PROCESS, PART 2: EXECUTING THE REBUILD
▸ Once all new nodes are in the new data center, continue.
▸ Modify the NTS settings to include the new data center.
▸ Run the rebuild process on as many concurrent nodes as latency metrics allow.
▸ To mitigate load on existing nodes, you may be able to use multiple data
center sources concurrently by using different data center parameters for the
nodetool rebuild command.
REBUILD
PROCESS, PART 3: USING THE NEW DATA CENTER
▸ Once all nodes have been completed the rebuild process, continue.
▸ If the intent was to remove a deprecated data center, update the
DCAwarePolicy for the Cassandra driver to point to the new data center and
restart the application. Then update NTS, and remove deprecated nodes.
▸ If the intent was to add a new data center, launch new application servers
within the same data center and modify the DCAwarePolicy to reference the
new data center.
REBUILD
WHEN TO USE THE REBUILD PROCESS
▸ Are you attempting to add or deprecate an entire data center?
▸ Migrate to new hardware?
▸ Moving to the cloud?
▸ Moving to a different cloud?
▸ Moving back from the cloud? :)
▸ Is YES the response to any of the above questions? Then rebuild.
THIS IS A WELL TRODDEN PATH.
DON'T WORRY.
BE HAPPY.
The Last Pickle
REBUILD
STATUS
STATUS
CHECK PROGRESS - SHOW OUTPUT
▸ nodetool compactionstats
▸ Monitor pending compactions, which are a byproduct of:
▸ Streaming data.
▸ nodetool cleanup
▸ Use nodetool setcompactionthroughput to throttle disk load.
▸ nodetool netstats
▸ Monitor active streaming tasks.
▸ nodetool status
▸ Monitor node's joining and up status.
RACKS
RACKS
PROPER BALANCE
▸ Balance is important when considering: disk load, token distribution, data center
load, and rack load.
▸ While other setups may be valid, Keep It Super Simple.
▸ Use 1 rack, or enough racks to equal the replication_strategy, for the data center.
▸ Ensure each rack always has an equal number of nodes.
▸ Each rack splits up the token range amongst themselves.
▸ Each data center will store its copies across racks, if available.
INTERNODE SECURITY
INTERNODE SECURITY
SSL ENCRYPTION
▸ Cassandra supports the following types of internode encryption:
▸ None.
▸ Inter-data center.
▸ Intra-data center, or inter-node.
▸ If data centers are separated by a public network, TLP recommends using inter-data
center encryption.
▸ If running with paranoid security settings, encryption can be used between each node,
regardless of topology settings.
EXPOSED JMX PORT
EXPOSED JMX PORT
LOCKING DOWN YOUR JMX PORT
▸ Cassandra allows access to system metrics, system commands, and potentially
destructive commands via a JMX port.
▸ By default, Cassandra 2.1.4+ restricts access to the JMX port to localhost.
▸ If remote access to JMX is required, edit cassandra-env.sh to change access
or authentication settings.
▸ TLP still recommends proper firewall and security settings be used to restrict
access to Cassandra from only verified machines.
REMOVING NODES
REMOVING NODES
DECOMMISSION
▸ nodetool decommission should be used for nodes that are still operational, but will no
longer be part of the rack.
▸ When decommissioning a node, the node's replica ranges are redistributed amongst the
surviving nodes in the rack.
▸ A "reverse bootstrap" occurs in which all replicas that the node is responsible for are
streamed to the new replica owners.
▸ Once all new replica owners hold all of the deprecated node's data, the node is removed
from the ring.
▸ After 72 hours, the node is removed from the gossip state.
REMOVING NODES
REMOVENODE
▸ nodetool removenode can be used when a node has died and will not be
replaced.
▸ The removenode command does not handle any streaming tasks, so follow up
repairs are required to ensure the cluster is in a consistent state.
▸ The removenode command simply removes a node from the gossip state,
forcing surviving nodes within the rack into being responsible for the
deprecated node's token ranges.
REMOVING NODES
ASSASSINATE
▸ Note: This is NOT a hammer.
▸ Sometimes gossip state can become wonky with echoes of previously removed nodes. In
these cases, and only in these cases, should nodetool assassinate be used.
▸ Much like nodetool removetoken, this command modifies the gossip state, but instead
of marking the node as being REMOVED, the entry is removed in its entirety.
▸ Sometimes a single command may not remove a stubborn gossip state. In these cases,
running nodetool assassinate across all nodes, in parallel, multiple times, may be
needed to remove any culprit gossip echoed states.
▸ Repeated note: This is NOT a hammer.
RECAP
RECAP
p2p NETWORKS
▸ KaZaA: Not really p2p.
▸ Bittorrent: Decentralized p2p.
▸ Cassandra: Stateful p2p, via Gossip.
RECAP
Fundamentals
▸ Nodes own multiple token ranges, when using Vnodes.
▸ Seed nodes allow new nodes to enter the cluster.
▸ Seed nodes also help keep a "consistent" topological state.
RECAP
ADDING CAPACITY
▸ Bootstrap
▸ The normal process to adding a node.
▸ Follow up with nodetool cleanup.
▸ replace_address
▸ Useful when a node is completely lost.
▸ nodetool rebuild
▸ Used to add an entire data center.
RECAP
STATUS
▸ Adding nodes creates collateral processes:
▸ Compaction.
▸ Streaming.
▸ Gossip entries.
RECAP
BE MINDFUL
▸ Ensure racks remain balanced upon topological changes.
▸ Ensure inter-node encryption is considered, especially when communicating
over an open network.
▸ Ensure JMX access is not accidentally exposed to public access.
RECAP
REMOVING NODES
▸ nodetool decommission
▸ Deprecate a live node.
▸ nodetool removetoken
▸ Remove a downed node and reassign token ranges.
▸ nodetool assassinate
▸ Remove a gossip entry entirely.
▸ Note: This is still NOT a hammer.
BUELLER? BUELLER?
QUESTIONS?
JOAQUIN@THELASTPICKLE.COM
Joaquin Casares, The Last Pickle
I'M BOB.
BON VOYAGE!
DIGIORNO
HASTA LA VISTA, BABY.
TTYL
TTYS
WELL THAT WAS FUN.
DON'T FORGET ABOUT ME!
WE'LL ALWAYS HAVE PARIS.
HERE'S LOOKING AT YOUR, KID.
I KNOW NOW WHY YOU CRY.
CYA

Joining a p2p Conversation - 2017-06 Meetup

  • 1.
    JOINING A p2pCONVERSATION JOAQUIN CASARES, THE LAST PICKLE
  • 2.
    JOINING A p2pCONVERSATION JOAQUIN CASARES ▸ The Last Pickle ▸ Consultant ▸ Previously: ▸ Umbel ▸ Software Engineer ▸ Riptano/DataStax ▸ Support Engineer ▸ Software Engineer-in-Test ▸ Demo Engineer
  • 3.
    JOINING A p2pCONVERSATION THE LAST PICKLE ▸ 50+ years combined experience with Apache Cassandra. ▸ We communicate ideas. ▸ We are committed to doing the right thing for both our team of experts and our clients. ▸ Our passion for sharing our knowledge is present in all that we do. ▸ Consider us a member of your team. ▸ Ultimately: ▸ We want you to be successful and have all the information to do so.
  • 4.
    WHERE DO WEGO FROM HERE?
  • 5.
    OVERVIEW Overview ▸ p2p Networks. ▸Cassandra fundamentals. ▸ How to add capacity. ▸ How to check on the status. ▸ Things you shouldn't forget, I think. ▸ How to forget.
  • 6.
  • 7.
  • 8.
    KaZaA TO BITTORRENT KaZaA:"P2P" PEER SUPERNODE
  • 9.
    KaZaA TO BITTORRENT KaZaA:"CENTRALIZED P2P" PEER SUPERNODE KAZAA.COM
  • 10.
    KaZaA TO BITTORRENT KaZaA:SHUTDOWN PEER SUPERNODE KAZAA.COM CAN ANYONE HERE ME? IS ANYONE ALIVE OUT THERE? NO.
  • 11.
    KaZaA TO BITTORRENT BITTORRENT:A REAL DECENTRALIZED, DISTRIBUTED P2P NETWORK PEER TRACKER SELF
  • 12.
    KaZaA TO BITTORRENT BITTORRENT:A REAL DECENTRALIZED, DISTRIBUTED P2P NETWORK PEER TRACKER SELF MY CONNECTION WAS SEVERED.
  • 13.
    KaZaA TO BITTORRENT BITTORRENT:A REAL DECENTRALIZED, DISTRIBUTED P2P NETWORK PEER TRACKER SELF I'M BANNED BY A STATE ACTOR. I KNOW THAT HASH.
  • 14.
  • 15.
  • 16.
  • 17.
    CASSANDRA TOKENS LEGACY OWNERSHIP AG M T M G A T (T, A] (G, M] (A, G](M, T]
  • 18.
    CASSANDRA TOKENS LEGACY OWNERSHIP AG M T M G A T (T, A] (A, G] (G, M] (M, T]
  • 19.
    CASSANDRA TOKENS VIRTUAL NODE("VNODES") OWNERSHIP A D G J 3 2 1 4 M Q T W 1 3 2 1 3 2 4 4
  • 20.
    CASSANDRA TOKENS VIRTUAL NODE("VNODES") OWNERSHIP A D G J 3 2 1 4 M Q T W 1 3 2 1 3 2 4 4
  • 21.
    CASSANDRA TOKENS VNODES -JOINING A NODE B D G J 3 2 1 4 M Q T W 5 3 2 1 3 2 4 4 5 P 5 A 1
  • 22.
    USE 32-64 VNODES. NOT:DEFAULT OF 256 VNODES / NODE. The Last Pickle CASSANDRA TOKENS
  • 23.
    USE THE SAMENUM_TOKENS COUNT ACROSS ALL MACHINES. The Last Pickle CASSANDRA TOKENS
  • 24.
    CASSANDRA TOKENS TIDBIT: MD5TOKENS TO MURMUR3 TOKENS ▸ MD5 token range: 0 to 2127 ▸ Murmur3 token range: -263 to 263 -1 ▸ Murmur3: ▸ Better randomness ▸ Lower chance of collisions ▸ ~6x faster than MD5
  • 25.
  • 26.
    SEED NODES CASSANDRA: P2PIMPLEMENTATION - JOINING PEER SEED JOINING CAN I JOIN THE PARTY? HEY BOB, WE HAVE A NEW FRIEND. I'M BOB.
  • 27.
    SEED NODES CASSANDRA: P2PIMPLEMENTATION PEER SEED NODE HAVE SOME OF MY TOKENS. I DIDN'T WANT THESE RANGES ANYWAY.
  • 28.
    SEED NODES CASSANDRA: P2PIMPLEMENTATION - SEED FAILURE PEER SEED NODE IT'S OK. I KNOW WHO MY FRIENDS ARE.
  • 29.
    SEED NODES CASSANDRA: P2PIMPLEMENTATION - GOSSIP PEER SEED NODE I'M BOB. HEY BOB, I'M STILL HERE.
  • 30.
    SEED NODES TIDBIT: GOSSIP ▸It works. ▸ Basically: echoes. ▸ CDN Gossip use case.
  • 31.
    SEED NODES CASSANDRA: P2PIMPLEMENTATION - FAILED JOINING PEER SEED FAILED JOIN NONE OF THESE ADDRESSES WORK. WHERE'S THE PARTY?
  • 32.
    SEED NODES CASSANDRA: P2PIMPLEMENTATION - FAILED JOINING OF SEED NODE PEER SEED JOINING SEED SO... WAS IT THE CHICKEN? OR THE EGG?
  • 33.
    USE THE SAMESEED NODES THROUGHOUT THE CLUSTER. The Last Pickle SEED NODES
  • 34.
    3-5 SEED NODESPER DATA CENTER. The Last Pickle SEED NODES
  • 35.
  • 36.
  • 37.
  • 38.
    BOOTSTRAP REPLICA OWNERSHIP B DG J 3 2 1 4 M Q T W 5 3 2 1 3 2 4 4 5 P 5 A 1
  • 39.
    BOOTSTRAP REPLICA OWNERSHIP B DG J 3 2 1 4 M Q T W 5 3 2 1 3 2 4 4 5 P 5 A 1 B C D A
  • 40.
    BOOTSTRAP REPLICA OWNERSHIP B DG J 3 2 1 4 M Q T W 5 3 2 1 3 2 4 4 5 P 5 A 1 B C D A Q P O N M
  • 41.
    3 2 BOOTSTRAP POST-BOOTSTRAP: CLEANUP BD G J 3 2 1 4 M Q T W 5 3 2 1 3 2 4 4 5 P 5 A 1 B C D A Q P O N P O N B I REALLY DON'T NEED THIS EXTRA PRESSURE P, O, & N. SEE YOU, C! M
  • 42.
    BOOTSTRAP POST-BOOTSTRAP: CLEANUP B DG J 3 2 1 4 M Q T W 5 3 2 1 3 2 4 4 5 P 5 A 1 B D A QP I AM NOW RESPONSIBLE FOR BP. M
  • 43.
    BOOTSTRAP WHEN TO USETHE BOOTSTRAP PROCESS ▸ Prerequisite: Is everything UN (Up/Normal)? ▸ nodetool status
  • 44.
    BOOTSTRAP WHEN TO USETHE BOOTSTRAP PROCESS ▸ Are you hitting a disk capacity issue? ▸ df -h
  • 45.
    BOOTSTRAP WHEN TO USETHE BOOTSTRAP PROCESS ▸ Are you hitting CPU capacity limits? ▸ top/htop
  • 46.
    BOOTSTRAP WHEN TO USETHE BOOTSTRAP PROCESS ▸ Does request latency have room for improvement? ▸ nodetool cfstats ▸ nodetool tablestats on newer versions of Cassandra.
  • 47.
    BOOTSTRAP WHEN TO USETHE BOOTSTRAP PROCESS ▸ Do you want to split up token hot spots? ▸ nodetool status $KEYSPACE
  • 48.
    BOOTSTRAP WHEN TO USETHE BOOTSTRAP PROCESS ▸ Is the prerequisite met and YES to any of the other questions? Then bootstrap!
  • 49.
    BOOTSTRAP PLAY IT SAFE ▸UJ (Up/Joining) is an ephemeral state after ~2 minutes, but wait 5 minutes. ▸ UN (Up/Normal) is a persistent state. ▸ With Cassandra 2.2+, we have nodetool bootstrap resume, or simply restarting the node. ▸ With Cassandra pre-2.2, we must clear: ▸ data_file_directories ▸ commitlog_directory ▸ saved_caches_directory
  • 50.
    IF "POPCORN" JOINING: NODETOOLNETSTATS The Last Pickle BOOTSTRAP
  • 51.
    BOOTSTRAP DON'TS ▸ Seed nodescannot be bootstrapped. ▸ No need to include auto_bootstrap parameter in the cassandra.yaml. ▸ Do not join more than one node per rack concurrently. ▸ Do not start two bootstrap processes within 2 minutes of each other. ▸ Do not bootstrap a node when there are more than 0 nodes offline. ▸ The JVM option -Dcassandra.consistent.rangemovement=false can be used to override the default behavior. ▸ Will require a follow-up rolling repair. ▸ Do not bootstrap a node running a different version of Cassandra. ▸ Do not bootstrap new nodes into a mixed-version cluster. ▸ Don't forget about your racks! More on that later...
  • 52.
    BOOTSTRAP TIDBIT: CLEANUP ▸ Cleanupremoves stale replicas off a node. ▸ Acts like a single-SSTable compaction. ▸ Does not need to be run manually, since any followup compactions will remove stale data. ▸ Is useful when disk capacity is at it's limit.
  • 53.
    KEEP FREE DISKSPACE UNDER 50% TO ALLOW FOR NORMAL COMPACTIONS TO COMPLETE SUCCESSFULLY. The Last Pickle BOOTSTRAP
  • 54.
  • 55.
  • 56.
    REPLACE_ADDRESS REPLACING PEER SEED REPLACING HEY FELLOWS, I'MTINA, THE NEW JOHN. HEY TINA, HERE'S THE DATA JOHN SHOULD HAVE HAD.
  • 57.
    FOR IMPORTANT DATA: USEAN ODD-NUMBERED REPLICATION_FACTOR > 1. The Last Pickle REPLACE_ADDRESS
  • 58.
    FOR IMPORTANT DATA: WRITEAT A CONSISTENCY LEVEL OF LOCAL_QUORUM OR HIGHER. The Last Pickle REPLACE_ADDRESS
  • 59.
    TO AVOID STALEDATA: READ AT A CONSISTENCY LEVEL OF LOCAL_QUORUM OR HIGHER. The Last Pickle REPLACE_ADDRESS
  • 60.
    REPLACE_ADDRESS TIDBITS: REPLACE_ADDRESS ▸ Ifusing cl.ONE, you might have a bad time. ▸ But: ▸ Remember to respect max_hint_window_in_ms, which defaults to 3 hours and starts as soon as the original node is knocked offline. ▸ If the hinted handoff window is missed, a rolling repair may be needed. ▸ Use -Dcassandra.replace_address_first_boot=<IP_ADDRESS> to prevent possible issues if you forget to remove the flag.
  • 61.
    REPLACE_ADDRESS TIDBITS: REPLACE_ADDRESS -EXPANDED ▸ If not using a consistency level higher than ONE, stale data or data loss is possible and likely in the event of a node failure. ▸ But: ▸ Hinted handoff may prevent stale data or data loss if a new node is in the UN (Up/Normal) state within the max_hint_window_in_ms, which defaults to 3 hours and starts as soon as the original node is knocked offline. ▸ If the hinted handoff window is missed, and if running with a replication factor > 1, and other replicas received the newer mutation, a rolling repair will make the new replica consistent. ▸ Read-repair may prevent stale data from being returned and repair stale partitions, but the dclocal_read_repair_chance table schema parameter defaults to 10% of all requests. ▸ Use -Dcassandra.replace_address_first_boot=<IP_ADDRESS> to prevent possible issues if you forget to remove the flag.
  • 62.
    REPLACE_ADDRESS WHEN TO USETHE REPLACE_ADDRESS PROCESS ▸ Prerequisite: Is the node unrecoverable/inaccessible? ▸ Are the current snapshots too stale? ▸ Most times snapshots are used for disaster scenarios. ▸ Is there enough time to replace the old node and run a repair before gc_grace_seconds, which defaults to 10 days? ▸ If not, use nodetool removetoken. ▸ Is YES the response to all of the above questions? Then replace_address.
  • 63.
  • 64.
    REBUILD_ADDRESS EASY WAYTO SWAP TO NEW HARDWARE. The Last Pickle REBUILD
  • 65.
    REBUILD JOINING PEER SEED JOINING I'M BOB. BUT DON'TSEND US YOUR DATA. WE'LL ASK WHEN READY. I'M JULIA.
  • 66.
    REBUILD REBUILD PEER SEED REBUILDING LET'S DO THIS! HOWMUCH LONGER, REALLY? WOW, THOSE ARE FEW GIGS! OK, READY NOW.
  • 67.
    REBUILD REBUILD PEER SEED REBUILDING LET'S DO THIS! HOWMUCH LONGER, REALLY? WOW, THOSE ARE FEW GIGS! OK, READY NOW.
  • 68.
    REBUILD PROCESS: GETTING THINGSSORTED ▸ Prerequisite: Use the NetworkTopologyStrategy, instead of the SimpleStrategy. ▸ Prerequisite: Use a DCAwarePolicy with the Cassandra driver to restrict the contact points. ▸ Prerequisite: Use LOCAL_QUORUM and LOCAL_ONE consistency levels to restrict requests to the specific data center. ▸ Do not define a replication strategy to the new new data center at first. ▸ Bootstrap all intended nodes into the new data center. ▸ Because there is no replication strategies that mention the new data center, this should cause almost no streaming tasks.
  • 69.
    REBUILD PROCESS, PART 2:EXECUTING THE REBUILD ▸ Once all new nodes are in the new data center, continue. ▸ Modify the NTS settings to include the new data center. ▸ Run the rebuild process on as many concurrent nodes as latency metrics allow. ▸ To mitigate load on existing nodes, you may be able to use multiple data center sources concurrently by using different data center parameters for the nodetool rebuild command.
  • 70.
    REBUILD PROCESS, PART 3:USING THE NEW DATA CENTER ▸ Once all nodes have been completed the rebuild process, continue. ▸ If the intent was to remove a deprecated data center, update the DCAwarePolicy for the Cassandra driver to point to the new data center and restart the application. Then update NTS, and remove deprecated nodes. ▸ If the intent was to add a new data center, launch new application servers within the same data center and modify the DCAwarePolicy to reference the new data center.
  • 71.
    REBUILD WHEN TO USETHE REBUILD PROCESS ▸ Are you attempting to add or deprecate an entire data center? ▸ Migrate to new hardware? ▸ Moving to the cloud? ▸ Moving to a different cloud? ▸ Moving back from the cloud? :) ▸ Is YES the response to any of the above questions? Then rebuild.
  • 72.
    THIS IS AWELL TRODDEN PATH. DON'T WORRY. BE HAPPY. The Last Pickle REBUILD
  • 73.
  • 74.
    STATUS CHECK PROGRESS -SHOW OUTPUT ▸ nodetool compactionstats ▸ Monitor pending compactions, which are a byproduct of: ▸ Streaming data. ▸ nodetool cleanup ▸ Use nodetool setcompactionthroughput to throttle disk load. ▸ nodetool netstats ▸ Monitor active streaming tasks. ▸ nodetool status ▸ Monitor node's joining and up status.
  • 75.
  • 76.
    RACKS PROPER BALANCE ▸ Balanceis important when considering: disk load, token distribution, data center load, and rack load. ▸ While other setups may be valid, Keep It Super Simple. ▸ Use 1 rack, or enough racks to equal the replication_strategy, for the data center. ▸ Ensure each rack always has an equal number of nodes. ▸ Each rack splits up the token range amongst themselves. ▸ Each data center will store its copies across racks, if available.
  • 77.
  • 78.
    INTERNODE SECURITY SSL ENCRYPTION ▸Cassandra supports the following types of internode encryption: ▸ None. ▸ Inter-data center. ▸ Intra-data center, or inter-node. ▸ If data centers are separated by a public network, TLP recommends using inter-data center encryption. ▸ If running with paranoid security settings, encryption can be used between each node, regardless of topology settings.
  • 79.
  • 80.
    EXPOSED JMX PORT LOCKINGDOWN YOUR JMX PORT ▸ Cassandra allows access to system metrics, system commands, and potentially destructive commands via a JMX port. ▸ By default, Cassandra 2.1.4+ restricts access to the JMX port to localhost. ▸ If remote access to JMX is required, edit cassandra-env.sh to change access or authentication settings. ▸ TLP still recommends proper firewall and security settings be used to restrict access to Cassandra from only verified machines.
  • 81.
  • 82.
    REMOVING NODES DECOMMISSION ▸ nodetooldecommission should be used for nodes that are still operational, but will no longer be part of the rack. ▸ When decommissioning a node, the node's replica ranges are redistributed amongst the surviving nodes in the rack. ▸ A "reverse bootstrap" occurs in which all replicas that the node is responsible for are streamed to the new replica owners. ▸ Once all new replica owners hold all of the deprecated node's data, the node is removed from the ring. ▸ After 72 hours, the node is removed from the gossip state.
  • 83.
    REMOVING NODES REMOVENODE ▸ nodetoolremovenode can be used when a node has died and will not be replaced. ▸ The removenode command does not handle any streaming tasks, so follow up repairs are required to ensure the cluster is in a consistent state. ▸ The removenode command simply removes a node from the gossip state, forcing surviving nodes within the rack into being responsible for the deprecated node's token ranges.
  • 84.
    REMOVING NODES ASSASSINATE ▸ Note:This is NOT a hammer. ▸ Sometimes gossip state can become wonky with echoes of previously removed nodes. In these cases, and only in these cases, should nodetool assassinate be used. ▸ Much like nodetool removetoken, this command modifies the gossip state, but instead of marking the node as being REMOVED, the entry is removed in its entirety. ▸ Sometimes a single command may not remove a stubborn gossip state. In these cases, running nodetool assassinate across all nodes, in parallel, multiple times, may be needed to remove any culprit gossip echoed states. ▸ Repeated note: This is NOT a hammer.
  • 85.
  • 86.
    RECAP p2p NETWORKS ▸ KaZaA:Not really p2p. ▸ Bittorrent: Decentralized p2p. ▸ Cassandra: Stateful p2p, via Gossip.
  • 87.
    RECAP Fundamentals ▸ Nodes ownmultiple token ranges, when using Vnodes. ▸ Seed nodes allow new nodes to enter the cluster. ▸ Seed nodes also help keep a "consistent" topological state.
  • 88.
    RECAP ADDING CAPACITY ▸ Bootstrap ▸The normal process to adding a node. ▸ Follow up with nodetool cleanup. ▸ replace_address ▸ Useful when a node is completely lost. ▸ nodetool rebuild ▸ Used to add an entire data center.
  • 89.
    RECAP STATUS ▸ Adding nodescreates collateral processes: ▸ Compaction. ▸ Streaming. ▸ Gossip entries.
  • 90.
    RECAP BE MINDFUL ▸ Ensureracks remain balanced upon topological changes. ▸ Ensure inter-node encryption is considered, especially when communicating over an open network. ▸ Ensure JMX access is not accidentally exposed to public access.
  • 91.
    RECAP REMOVING NODES ▸ nodetooldecommission ▸ Deprecate a live node. ▸ nodetool removetoken ▸ Remove a downed node and reassign token ranges. ▸ nodetool assassinate ▸ Remove a gossip entry entirely. ▸ Note: This is still NOT a hammer.
  • 92.
  • 93.
    JOAQUIN@THELASTPICKLE.COM Joaquin Casares, TheLast Pickle I'M BOB. BON VOYAGE! DIGIORNO HASTA LA VISTA, BABY. TTYL TTYS WELL THAT WAS FUN. DON'T FORGET ABOUT ME! WE'LL ALWAYS HAVE PARIS. HERE'S LOOKING AT YOUR, KID. I KNOW NOW WHY YOU CRY. CYA