DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Summit 2016

Thom Valley
Distributing the Enterprise, Safely

1 Distributing the Enterprise
2 The Fault Tolerance Fallacy
3 Testing (On the Spectrum)
4 Critical Factors
5 Example Test Scenarios
2© DataStax, All Rights Reserved.

Distributing the Enterprise - Assumptions
© DataStax, All Rights Reserved. 4
Can Install / Configure
Have Appropriate / Adequate Hardware (?)
Sound Data Model

Distributing the Enterprise
San
Francisco
New York
Dubai
Massively Scalable
Operationally Simple
Always On
Geographically Distributed
Fault Tolerant

The Fault Tolerance Fallacy
The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
--Sun Microsystem Fellows…

My cluster has N copies of my data, I don’t need to backup…
Cassandra / DSE is easy to scale, just add nodes…
If a node fails, I will just replace it…
If a DC fails, everything will just fail over…
X and Y (big company) are using it, it must just work…

IF YOU CAN’T RECOVER, YOU’RE NOT FAULT TOLERANT

Very few organizations test for failure (at scale)
Most configurations / circumstances are unique
Anything that you haven’t proven is a potential failure point
Almost everything works at a small scale
The features designed to save you may cause your failure
Understanding Mean Time to Recover (MTTR)

Testing (On the Spectrum)
TEST?
testing acumen

Testing - Guidance
Test at scale (nodes / data size / throughput)
Test with a real data model / production compaction strategies
Test with your full stack
Don’t just test for speeds and feeds
Test for extended time periods (days or weeks, not hours)
Test failure scenarios
Test recovery scenarios (understand MTTR)

Critical Factors
Node Performance / Density
LAN Performance
WAN Performance
Bootstrap/Rebuild Window
DC Rebuild Window
Cluster Capacity
Configuration*
Hinted Handoff (?)
Repair
Rack / AZ Outages
Rolling Restarts
Upgrades

Critical Factors - Configuration
Compaction Strategy
VNODES
Heap Settings
Throttles

Critical Factors - Understanding
Impact of workload on cluster / nodes
Impact of failed state on cluster
Impact of recovery process on cluster
Mean time to recover - MTTR (node, data center)

Critical Factors - Results
Proper node / cluster sizing
Configuration validation
Operational readiness (understand MTTR):
Single Node
Full Rack / AZ
Full DC

General Guidelines
Whenever possible, be running full workload simulation
Run repair in background (where possible)
Monitor application / cluster
Test scenarios more than once

Single Node Failure (within hint window)
General Process
• Get workload running / monitoring
• Fail node within hint window (2 hour target)
• Bring node back online

Things to Watch
• Impact to workload during failure, replay, compaction
• Load on cluster during same
• Can you recover at node density / workload?
• MTTR to get to normal state?

Other Considerations
• Hint compaction (pre 3.0)
• Multi-DC with distributed workload / hint replay
• Impact of latency on Multi-DC
• Impact of throttling (or lack) on hint replay
• What’s a sustainable hint window w/ workload?

Single Node Failure (beyond hint window)
General Process
• Get workload running / monitoring
• Fail node for more than hint window
• Bring node back online / rebuild node
• (Optionally truncate hints if that’s your ops plan)

Things to Watch
• Impact to workload during failure, replay, compaction
• Impact to workload during node rebuild
• Load on cluster during same
• MTTR for rebuilding node with workload running,
including compaction

Other Considerations
• How does node density impact node rebuild?
• How does compaction strategy impact node rebuild?
• What is the impact of VNODES on streaming /
compaction during rebuild?

Rack / AZ Failure (within hint window)
General Process
Same as above (single node test) but for full AZ
Considerations
Can cluster sustain workload? At what utilization?
Can cluster sustain hint buildup?
What is a sustainable hint window with workload?

Rack / AZ Failure (beyond hint window)
General Process
Same as above (single node test) but for full Rack / AZ
Considerations
Can cluster sustain hint buildup? Truncate Hints?
Can cluster sustain rebuilding full Rack / AZ

Full DC Failure (within hint window)
General Process
Same as above (single node test) but for full DC
Considerations
Can cluster sustain hint buildup?
Impact of hint replay on remote DCs? WAN Utilization?
What is a sustainable hint window with workload?

Full DC Failure (beyond hint window)
General Process
Same as above (single node test) but for full DC
Considerations
Can cluster sustain hint buildup? Truncate Hints?
Can cluster sustain rebuilding full DC?
Impact on WAN links during full DC rebuild?

Long Running Workload
Considerations
Change in response times (read and/or write)
Impact of delete workloads
Impact of compaction settings
Changing impact of repair operations

Other Scenarios to Consider
Flapping node
Flapping network (single node)
Flapping network (WAN)
Over consuming WAN (Multi-DC)

Strategies
Pick scenarios that matter to you
Test in advance at as close to scale as possible
Repeat tests on major changes:
Infrastructure
Compaction Strategies
Workload (read/write mix, volume, etc.)

DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Summit 2016

Similar to DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Summit 2016

Editor's Notes