Joining a p2p Conversation - 2017-06 Meetup

JOINING A p2p CONVERSATION
JOAQUIN CASARES, THE LAST PICKLE

JOAQUIN CASARES
▸ The Last Pickle
▸ Consultant
▸ Previously:
▸ Umbel
▸ Software Engineer
▸ Riptano/DataStax
▸ Support Engineer
▸ Software Engineer-in-Test
▸ Demo Engineer

THE LAST PICKLE
▸ 50+ years combined experience with Apache Cassandra.
▸ We communicate ideas.
▸ We are committed to doing the right thing for both our team of experts and our clients.
▸ Our passion for sharing our knowledge is present in all that we do.
▸ Consider us a member of your team.
▸ Ultimately:
▸ We want you to be successful and have all the information to do so.

OVERVIEW
Overview
▸ p2p Networks.
▸ Cassandra fundamentals.
▸ How to add capacity.
▸ How to check on the status.
▸ Things you shouldn't forget, I think.
▸ How to forget.

KaZaA TO BITTORRENT
KaZaA: "P2P"
PEER
SUPERNODE

KaZaA TO BITTORRENT
KaZaA: "CENTRALIZED P2P"
PEER
SUPERNODE
KAZAA.COM

KaZaA TO BITTORRENT
KaZaA: SHUTDOWN
PEER
SUPERNODE
KAZAA.COM
CAN ANYONE HERE ME?
IS ANYONE ALIVE OUT THERE?
NO.

KaZaA TO BITTORRENT
BITTORRENT: A REAL DECENTRALIZED, DISTRIBUTED P2P NETWORK
PEER
TRACKER
SELF

KaZaA TO BITTORRENT
PEER
TRACKER
SELF
MY CONNECTION WAS SEVERED.

KaZaA TO BITTORRENT
PEER
TRACKER
SELF
I'M BANNED BY A STATE ACTOR.
I KNOW THAT HASH.

CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T

CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T
M
G
A
T

CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T
M
G
A
T
(T, A]
(G, M]
(A, G](M, T]

CASSANDRA TOKENS
LEGACY OWNERSHIP
A G M T
M
G
A
T
(T, A]
(A, G]
(G, M]
(M, T]

CASSANDRA TOKENS
VIRTUAL NODE ("VNODES") OWNERSHIP
A D G J
3
2
1
4
M Q T W
1 3 2 1 3 2 4 4

CASSANDRA TOKENS
VNODES - JOINING A NODE
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1

USE 32-64 VNODES.
NOT: DEFAULT OF 256 VNODES / NODE.
The Last Pickle
CASSANDRA TOKENS

USE THE SAME NUM_TOKENS COUNT
ACROSS ALL MACHINES.
The Last Pickle
CASSANDRA TOKENS

CASSANDRA TOKENS
TIDBIT: MD5 TOKENS TO MURMUR3 TOKENS
▸ MD5 token range: 0 to 2127
▸ Murmur3 token range: -263
to 263
-1
▸ Murmur3:
▸ Better randomness
▸ Lower chance of collisions
▸ ~6x faster than MD5

SEED NODES
CASSANDRA: P2P IMPLEMENTATION - JOINING
PEER
SEED
JOINING
CAN I JOIN THE PARTY?
HEY BOB, WE HAVE A NEW
FRIEND.
I'M BOB.

SEED NODES
CASSANDRA: P2P IMPLEMENTATION
PEER
SEED
NODE
HAVE SOME OF MY TOKENS.
I DIDN'T WANT THESE RANGES
ANYWAY.

SEED NODES
CASSANDRA: P2P IMPLEMENTATION - SEED FAILURE
PEER
SEED
NODE
IT'S OK. I KNOW WHO MY
FRIENDS ARE.

SEED NODES
CASSANDRA: P2P IMPLEMENTATION - GOSSIP
PEER
SEED
NODE
I'M BOB.
HEY BOB, I'M STILL HERE.

SEED NODES
TIDBIT: GOSSIP
▸ It works.
▸ Basically: echoes.
▸ CDN Gossip use case.

SEED NODES
CASSANDRA: P2P IMPLEMENTATION - FAILED JOINING
PEER
SEED
FAILED JOIN
NONE OF THESE ADDRESSES
WORK. WHERE'S THE PARTY?

SEED NODES
CASSANDRA: P2P IMPLEMENTATION - FAILED JOINING OF SEED NODE
PEER
SEED
JOINING SEED
SO... WAS IT THE CHICKEN?
OR THE EGG?

USE THE SAME SEED NODES
THROUGHOUT THE CLUSTER.
The Last Pickle
SEED NODES

3-5 SEED NODES PER DATA CENTER.
The Last Pickle
SEED NODES

BOOTSTRAP
ANNOUNCING
PEER
SEED
JOINING
I'M JOINING!

BOOTSTRAP
STREAMING
PEER
SEED
JOINING
THANKS FOR ALL THE DATA!

BOOTSTRAP
REPLICA OWNERSHIP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1

BOOTSTRAP
REPLICA OWNERSHIP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
C
D
A

BOOTSTRAP
REPLICA OWNERSHIP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
C
D
A
Q
P
O
N
M

3 2
BOOTSTRAP
POST-BOOTSTRAP: CLEANUP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
C
D
A
Q
P
O
N
P
O
N
B
I REALLY DON'T NEED THIS
EXTRA PRESSURE P, O, & N.
SEE YOU, C!
M

BOOTSTRAP
POST-BOOTSTRAP: CLEANUP
B D G J
3
2
1
4
M Q T W
5 3 2 1 3 2 4 4
5
P
5
A
1
B
D
A
QP
I AM NOW RESPONSIBLE FOR BP.
M

BOOTSTRAP
WHEN TO USE THE BOOTSTRAP PROCESS
▸ Prerequisite: Is everything UN (Up/Normal)?
▸ nodetool status

BOOTSTRAP
▸ Are you hitting a disk capacity issue?
▸ df -h

BOOTSTRAP
▸ Are you hitting CPU capacity limits?
▸ top/htop

BOOTSTRAP
▸ Does request latency have room for improvement?
▸ nodetool cfstats
▸ nodetool tablestats on newer versions of Cassandra.

BOOTSTRAP
▸ Do you want to split up token hot spots?
▸ nodetool status $KEYSPACE

BOOTSTRAP
▸ Is the prerequisite met and YES to any of the other questions? Then bootstrap!

BOOTSTRAP
PLAY IT SAFE
▸ UJ (Up/Joining) is an ephemeral state after ~2 minutes, but wait 5 minutes.
▸ UN (Up/Normal) is a persistent state.
▸ With Cassandra 2.2+, we have nodetool bootstrap resume, or simply restarting
the node.
▸ With Cassandra pre-2.2, we must clear:
▸ data_file_directories
▸ commitlog_directory
▸ saved_caches_directory

IF "POPCORN" JOINING:
NODETOOL NETSTATS
The Last Pickle
BOOTSTRAP

BOOTSTRAP
DON'TS
▸ Seed nodes cannot be bootstrapped.
▸ No need to include auto_bootstrap parameter in the cassandra.yaml.
▸ Do not join more than one node per rack concurrently.
▸ Do not start two bootstrap processes within 2 minutes of each other.
▸ Do not bootstrap a node when there are more than 0 nodes ofﬂine.
▸ The JVM option -Dcassandra.consistent.rangemovement=false can be used to override the default behavior.
▸ Will require a follow-up rolling repair.
▸ Do not bootstrap a node running a different version of Cassandra.
▸ Do not bootstrap new nodes into a mixed-version cluster.
▸ Don't forget about your racks! More on that later...

BOOTSTRAP
TIDBIT: CLEANUP
▸ Cleanup removes stale replicas off a node.
▸ Acts like a single-SSTable compaction.
▸ Does not need to be run manually, since any followup compactions will remove
stale data.
▸ Is useful when disk capacity is at it's limit.

KEEP FREE DISK SPACE UNDER 50% TO
ALLOW FOR NORMAL COMPACTIONS TO
COMPLETE SUCCESSFULLY.
The Last Pickle
BOOTSTRAP

REPLACE_ADDRESS
REPLACING
PEER
SEED
DEAD
REPLACING
I'M REPLACING YOU JOHN!
JOHN, ARE YOU AROUND?

REPLACE_ADDRESS
REPLACING
PEER
SEED
REPLACING
HEY FELLOWS, I'M TINA,
THE NEW JOHN.
HEY TINA, HERE'S THE DATA
JOHN SHOULD HAVE HAD.

FOR IMPORTANT DATA:
USE AN ODD-NUMBERED
REPLICATION_FACTOR > 1.
The Last Pickle
REPLACE_ADDRESS

FOR IMPORTANT DATA:
WRITE AT A CONSISTENCY LEVEL OF
LOCAL_QUORUM OR HIGHER.
The Last Pickle
REPLACE_ADDRESS

TO AVOID STALE DATA:
READ AT A CONSISTENCY LEVEL OF
LOCAL_QUORUM OR HIGHER.
The Last Pickle
REPLACE_ADDRESS

REPLACE_ADDRESS
TIDBITS: REPLACE_ADDRESS
▸ If using cl.ONE, you might have a bad time.
▸ But:
▸ Remember to respect max_hint_window_in_ms, which defaults to 3 hours
and starts as soon as the original node is knocked ofﬂine.
▸ If the hinted handoff window is missed, a rolling repair may be needed.
▸ Use -Dcassandra.replace_address_first_boot=<IP_ADDRESS> to
prevent possible issues if you forget to remove the ﬂag.

REPLACE_ADDRESS
TIDBITS: REPLACE_ADDRESS - EXPANDED
▸ If not using a consistency level higher than ONE, stale data or data loss is possible and likely in the event of a node
failure.
▸ But:
▸ Hinted handoff may prevent stale data or data loss if a new node is in the UN (Up/Normal) state within the
max_hint_window_in_ms, which defaults to 3 hours and starts as soon as the original node is knocked ofﬂine.
▸ If the hinted handoff window is missed, and if running with a replication factor > 1, and other replicas
received the newer mutation, a rolling repair will make the new replica consistent.
▸ Read-repair may prevent stale data from being returned and repair stale partitions, but the
dclocal_read_repair_chance table schema parameter defaults to 10% of all requests.
▸ Use -Dcassandra.replace_address_first_boot=<IP_ADDRESS> to prevent possible issues if you forget to
remove the ﬂag.

REPLACE_ADDRESS
WHEN TO USE THE REPLACE_ADDRESS PROCESS
▸ Prerequisite: Is the node unrecoverable/inaccessible?
▸ Are the current snapshots too stale?
▸ Most times snapshots are used for disaster scenarios.
▸ Is there enough time to replace the old node and run a repair before
gc_grace_seconds, which defaults to 10 days?
▸ If not, use nodetool removetoken.
▸ Is YES the response to all of the above questions? Then replace_address.

REBUILD_ADDRESS EASY WAY TO SWAP
TO NEW HARDWARE.
The Last Pickle
REBUILD

REBUILD
JOINING
PEER
SEED
JOINING
I'M BOB.
BUT DON'T SEND US YOUR DATA.
WE'LL ASK WHEN READY.
I'M JULIA.

REBUILD
REBUILD
PEER
SEED
REBUILDING
LET'S DO THIS!
HOW MUCH LONGER, REALLY?
WOW, THOSE ARE FEW GIGS!
OK, READY NOW.

REBUILD
PROCESS: GETTING THINGS SORTED
▸ Prerequisite: Use the NetworkTopologyStrategy, instead of the SimpleStrategy.
▸ Prerequisite: Use a DCAwarePolicy with the Cassandra driver to restrict the contact points.
▸ Prerequisite: Use LOCAL_QUORUM and LOCAL_ONE consistency levels to restrict
requests to the specific data center.
▸ Do not define a replication strategy to the new new data center at first.
▸ Bootstrap all intended nodes into the new data center.
▸ Because there is no replication strategies that mention the new data center, this should
cause almost no streaming tasks.

REBUILD
PROCESS, PART 2: EXECUTING THE REBUILD
▸ Once all new nodes are in the new data center, continue.
▸ Modify the NTS settings to include the new data center.
▸ Run the rebuild process on as many concurrent nodes as latency metrics allow.
▸ To mitigate load on existing nodes, you may be able to use multiple data
center sources concurrently by using different data center parameters for the
nodetool rebuild command.

REBUILD
PROCESS, PART 3: USING THE NEW DATA CENTER
▸ Once all nodes have been completed the rebuild process, continue.
▸ If the intent was to remove a deprecated data center, update the
DCAwarePolicy for the Cassandra driver to point to the new data center and
restart the application. Then update NTS, and remove deprecated nodes.
▸ If the intent was to add a new data center, launch new application servers
within the same data center and modify the DCAwarePolicy to reference the
new data center.

REBUILD
WHEN TO USE THE REBUILD PROCESS
▸ Are you attempting to add or deprecate an entire data center?
▸ Migrate to new hardware?
▸ Moving to the cloud?
▸ Moving to a different cloud?
▸ Moving back from the cloud? :)
▸ Is YES the response to any of the above questions? Then rebuild.

THIS IS A WELL TRODDEN PATH.
DON'T WORRY.
BE HAPPY.
The Last Pickle
REBUILD

STATUS
CHECK PROGRESS - SHOW OUTPUT
▸ nodetool compactionstats
▸ Monitor pending compactions, which are a byproduct of:
▸ Streaming data.
▸ nodetool cleanup
▸ Use nodetool setcompactionthroughput to throttle disk load.
▸ nodetool netstats
▸ Monitor active streaming tasks.
▸ nodetool status
▸ Monitor node's joining and up status.

RACKS
PROPER BALANCE
▸ Balance is important when considering: disk load, token distribution, data center
load, and rack load.
▸ While other setups may be valid, Keep It Super Simple.
▸ Use 1 rack, or enough racks to equal the replication_strategy, for the data center.
▸ Ensure each rack always has an equal number of nodes.
▸ Each rack splits up the token range amongst themselves.
▸ Each data center will store its copies across racks, if available.

INTERNODE SECURITY
SSL ENCRYPTION
▸ Cassandra supports the following types of internode encryption:
▸ None.
▸ Inter-data center.
▸ Intra-data center, or inter-node.
▸ If data centers are separated by a public network, TLP recommends using inter-data
center encryption.
▸ If running with paranoid security settings, encryption can be used between each node,
regardless of topology settings.

EXPOSED JMX PORT
LOCKING DOWN YOUR JMX PORT
▸ Cassandra allows access to system metrics, system commands, and potentially
destructive commands via a JMX port.
▸ By default, Cassandra 2.1.4+ restricts access to the JMX port to localhost.
▸ If remote access to JMX is required, edit cassandra-env.sh to change access
or authentication settings.
▸ TLP still recommends proper ﬁrewall and security settings be used to restrict
access to Cassandra from only veriﬁed machines.

REMOVING NODES
DECOMMISSION
▸ nodetool decommission should be used for nodes that are still operational, but will no
longer be part of the rack.
▸ When decommissioning a node, the node's replica ranges are redistributed amongst the
surviving nodes in the rack.
▸ A "reverse bootstrap" occurs in which all replicas that the node is responsible for are
streamed to the new replica owners.
▸ Once all new replica owners hold all of the deprecated node's data, the node is removed
from the ring.
▸ After 72 hours, the node is removed from the gossip state.

REMOVING NODES
REMOVENODE
▸ nodetool removenode can be used when a node has died and will not be
replaced.
▸ The removenode command does not handle any streaming tasks, so follow up
repairs are required to ensure the cluster is in a consistent state.
▸ The removenode command simply removes a node from the gossip state,
forcing surviving nodes within the rack into being responsible for the
deprecated node's token ranges.

REMOVING NODES
ASSASSINATE
▸ Note: This is NOT a hammer.
▸ Sometimes gossip state can become wonky with echoes of previously removed nodes. In
these cases, and only in these cases, should nodetool assassinate be used.
▸ Much like nodetool removetoken, this command modiﬁes the gossip state, but instead
of marking the node as being REMOVED, the entry is removed in its entirety.
▸ Sometimes a single command may not remove a stubborn gossip state. In these cases,
running nodetool assassinate across all nodes, in parallel, multiple times, may be
needed to remove any culprit gossip echoed states.
▸ Repeated note: This is NOT a hammer.

RECAP
p2p NETWORKS
▸ KaZaA: Not really p2p.
▸ Bittorrent: Decentralized p2p.
▸ Cassandra: Stateful p2p, via Gossip.

RECAP
Fundamentals
▸ Nodes own multiple token ranges, when using Vnodes.
▸ Seed nodes allow new nodes to enter the cluster.
▸ Seed nodes also help keep a "consistent" topological state.

RECAP
ADDING CAPACITY
▸ Bootstrap
▸ The normal process to adding a node.
▸ Follow up with nodetool cleanup.
▸ replace_address
▸ Useful when a node is completely lost.
▸ nodetool rebuild
▸ Used to add an entire data center.

RECAP
STATUS
▸ Adding nodes creates collateral processes:
▸ Compaction.
▸ Streaming.
▸ Gossip entries.

RECAP
BE MINDFUL
▸ Ensure racks remain balanced upon topological changes.
▸ Ensure inter-node encryption is considered, especially when communicating
over an open network.
▸ Ensure JMX access is not accidentally exposed to public access.

RECAP
REMOVING NODES
▸ nodetool decommission
▸ Deprecate a live node.
▸ nodetool removetoken
▸ Remove a downed node and reassign token ranges.
▸ nodetool assassinate
▸ Remove a gossip entry entirely.
▸ Note: This is still NOT a hammer.

JOAQUIN@THELASTPICKLE.COM
Joaquin Casares, The Last Pickle
I'M BOB.
BON VOYAGE!
DIGIORNO
HASTA LA VISTA, BABY.
TTYL
TTYS
WELL THAT WAS FUN.
DON'T FORGET ABOUT ME!
WE'LL ALWAYS HAVE PARIS.
HERE'S LOOKING AT YOUR, KID.
I KNOW NOW WHY YOU CRY.
CYA

Joining a p2p Conversation - 2017-06 Meetup

Recommended

Recommended

More Related Content

Similar to Joining a p2p Conversation - 2017-06 Meetup

Similar to Joining a p2p Conversation - 2017-06 Meetup (20)

Recently uploaded

Recently uploaded (20)

Joining a p2p Conversation - 2017-06 Meetup