Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Company) | C* Summit 2016

Always On:
Building Highly Available
Applications on Cassandra
Robbie Strickland

Who Am I?
Robbie Strickland
VP, Software Engineering
rstrickland@weather.com
@rs_atl An IBM Business

Who Am I?
• Contributor to C*
community since 2010
• DataStax MVP 2014/15/16
• Author, Cassandra High
Availability & Cassandra 3.x
High Availability
• Founder, ATL Cassandra User
Group

What is HA?
• Five nines – 99.999% uptime?
– Roughly 9 hours per year
– … or a full work day of down time!

What is HA?
• Five nines – 99.999% uptime?
– Roughly 9 hours per year
– … or a full work day of down time!
• Can we do better?

Cassandra + HA
• No SPOF
• Multi-DC replication
• Incremental backups
• Client-side failure handling
• Server-side failure handling
• Lots of JMX stats

HA by Design (it’s not an add-on)

• Properly designed topology

• Data model that respects C* architecture

• Application that handles failure

• Monitoring strategy with early warning

• Monitoring strategy with early warning
• DevOps mentality

Table Stakes
• NetworkTopologyStrategy

Table Stakes
• GossipingPropertyFileSnitch
– Or [YourCloud]Snitch

Table Stakes
• At least 5 nodes

Table Stakes
• RF=3

Table Stakes
• RF=3
• No load balancer

Consistency Basics
• Start with LOCAL_QUORUM reads & writes
– Balances performance & availability, and provides
single DC full consistency
– Experiment with eventual consistency (e.g.
CL=ONE) in a controlled environment

Consistency Basics
• Start with LOCAL_QUORUM reads & writes
– Balances performance & availability, and provides
single DC full consistency
– Experiment with eventual consistency (e.g.
CL=ONE) in a controlled environment
• Avoid non-local CLs in multi-DC environments
– Otherwise it’s a crap shoot

Rack Failure
• Don’t put all your
nodes in one rack!

Rack Failure
nodes in one rack!
• Use rack awareness
– Places replicas in
different racks

Rack Failure
nodes in one rack!
• Use rack awareness
– Places replicas in
different racks
• But don’t use
RackAwareSnitch

Rack Awareness
R2
R3R1
Rack A Rack B

Rack Awareness
R2
R3R1
Rack A Rack B
GossipingPropertyFileSnitch
cassandra-rackdc.properties
dc=dc1
rack=a
dc=dc1
rack=b

Rack Awareness (Cloud Edition)
R2
R3R1
Availability
Zone A
Availability
Zone B
[YourCloud]Snitch
(it’s automagic!)

Data Center Replication
dc=us-1 dc=eu-1

Data Center Replication
CREATE KEYSPACE myKeyspace
WITH REPLICATION = {
‘class’:’NetworkTopologyStrategy’,
‘us-1’:3,
‘eu-1’:3
}

Multi-DC Consistency?
dc=us-1 dc=eu-1
Assumption: LOCAL_QUORUM

dc=us-1 dc=eu-1
Fully
consistent
Fully
consistent

dc=us-1 dc=eu-1
Fully
consistent
Fully
consistent
?

dc=us-1 dc=eu-1
Fully
consistent
Fully
consistent
Eventually
consistent

Multi-DC Routing with LOCAL CL
Client App
us-1
Client App
eu-1

Multi-DC Routing with non-LOCAL CL
Client App
us-1
Client App
eu-1

Multi-DC Routing
• Use DCAwareRoundRobinPolicy wrapped by
TokenAwarePolicy
– This is the default
– Prefers local DC – chosen based on host distance
and seed list
– BUT this can fail for logical DCs that are physically
co-located, or for improperly defined seed lists!

Multi-DC Routing
Pro tip:
val localDC = //get from config
val dcPolicy =
new TokenAwarePolicy(
DCAwareRoundRobinPolicy.builder()
.withLocalDc(localDC)
.build()
)
Be explicit!!

Handling DC Failure
• Make sure backup DC has sufficient capacity
– Don’t try to add capacity on the fly!

Handling DC Failure
• Try to limit updates
– Avoids potential consistency issues on recovery

Handling DC Failure
• Try to limit updates
– Avoids potential consistency issues on recovery
• Be careful with retry logic
– Isolate it to a single point in the stack
– Don’t DDoS yourself with retries!

Topology Lessons
• Leverage rack awareness
• Use LOCAL_QUORUM
– Full local consistency
– Eventual consistency across DCs
• Run incremental repairs to maintain inter-DC
consistency
• Explicitly route local app to local C* DC
• Plan for DC failure

Quick Primer
• C* is a distributed hash table
– Partition key (first field in PK declaration)
determines placement in the cluster
– Efficient queries MUST know the key!

Quick Primer
• Data for a given partition is naturally sorted
based on clustering columns

Quick Primer
• Data for a given partition is naturally sorted
based on clustering columns
• Column range scans are efficient

Quick Primer
• All writes are immutable
– Deletes create tombstones
– Updates do not immediately purge old data
– Compaction has to sort all this out

Who Cares?
• Bad performance = application downtime &
lost users

Who Cares?
lost users
• Lagging compaction is an operations
nightmare

Who Cares?
lost users
• Lagging compaction is an operations
nightmare
• Some models & query patterns create serious
availability problems

Do
• Choose a partition key that distributes evenly

Do
• Model your data based on common read
patterns

Do
patterns
• Denormalize using collections & materialized
views

Do
patterns
• Denormalize using collections & materialized
views
• Use efficient single-partition range queries

Don’t
• Create hot spots in either data or traffic
patterns

Don’t
patterns
• Build a relational data model

Don’t
patterns
• Create an application-side join

Don’t
patterns
• Run multi-node queries

Don’t
patterns
• Run multi-node queries
• Use batches to group unrelated writes

Problem Case #1
SELECT *
FROM contacts
WHERE id IN (1,3,5,7,9)

Client
Problem Case #1
SELECT *
FROM contacts
WHERE id IN (1,3,5,7)
1 2
6 5
4 7
2 8
3 6
7 8
1 3
5 2
4 5
7 8
1 3
6 4
Must ask every 4 out of 6 nodes
in the cluster to satisfy quorum!

Client
Problem Case #1
SELECT *
FROM contacts
WHERE id IN (1,3,5,7)
1 2
6 5
4 7
2 8
3 6
7 8
1 3
5 2
4 5
7 8
1 3
6 4
“Not enough replicas available for query
at consistency LOCAL_QUORUM” X
X
1,3,5 all have sufficient replicas,
yet entire query fails because of 7

Solution #1
• Option 1: Be optimistic and run it anyway
– If it fails, you can fall back to option 2

Solution #1
• Option 1: Be optimistic and run it anyway
– If it fails, you can fall back to option 2
• Option 2: Run parallel queries for each key
– Return the results that are available
– Fall back to CL ONE for failed keys
– Client token awareness means coordinator does less
work

Problem Case #2
CREATE INDEX ON contacts(birth_year)
SELECT *
FROM contacts
WHERE birth_year=1975

Client
Problem Case #2
SELECT *
FROM contacts
1975:
Jim
Sue
1975:
Sam
Jim
1975:
Sue
Tim
1975:
Tim
Jim
1975:
Sue
Sam
1975:
Sam
Tim
Index lives with the source data
… so 5 nodes must be queried!

Client
Problem Case #2
SELECT *
FROM contacts
1975:
Jim
Sue
1975:
Sam
Jim
1975:
Sue
Tim
1975:
Tim
Jim
1975:
Sue
Sam
1975:
Sam
Tim
“Not enough replicas available for query
at consistency LOCAL_QUORUM”
Index lives with the source data
… so 5 nodes must be queried!
X
X

Solution #2
• Option 1: Build your own index
– App has to maintain the index

Solution #2
• Option 2: Use a materialized view
– Not available before 3.0

Solution #2
• Option 2: Use a materialized view
– Not available before 3.0
• Option 3: Run it anyway
– Ok for small amounts of data (think 10s to 100s of
rows) that can live in memory
– Good for parallel analytics jobs (Spark, Hadoop, etc.)

Problem Case #3
CREATE TABLE sensor_readings (
sensorID uuid,
timestamp int,
reading decimal,
PRIMARY KEY (sensorID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Problem Case #3
• Partition will grow unbounded
– i.e. it creates wide rows

Problem Case #3
• Unsustainable number of columns in each
partition

Problem Case #3
• Unsustainable number of columns in each
partition
• No way to archive off old data

Solution #3
CREATE TABLE sensor_readings (
sensorID uuid,
time_bucket int,
timestamp int,
reading decimal,
PRIMARY KEY ((sensorID, time_bucket),
timestamp)
) WITH CLUSTERING ORDER BY
(timestamp DESC);

Monitoring Basics
• Enable remote JMX

Monitoring Basics
• Connect a stats collector (jmxtrans, collectd,
etc.)

Monitoring Basics
etc.)
• Use nodetool for quick single-node queries

Monitoring Basics
etc.)
• Use nodetool for quick single-node queries
• C* tells you pretty much everything via JMX

Thread Pools
• C* is a SEDA architecture
– Essentially message queues feeding thread pools
– nodetool tpstats

Thread Pools
• C* is a SEDA architecture
– Essentially message queues feeding thread pools
– nodetool tpstats
• Pending messages are bad:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 103 0 0
RequestResponseStage 0 0 0 0 0
MutationStage 0 13234794 0 0 0

Lagging Compaction
• Lagging compaction is the reason for many
performance issues

Lagging Compaction
performance issues
• Reads can grind to a halt in the worst case

Lagging Compaction
performance issues
• Reads can grind to a halt in the worst case
• Use nodetool tablestats/cfstats &
compactionstats

Lagging Compaction
• Size-Tiered: watch for high SSTable counts:
Keyspace: my_keyspace
Read Count: 11207
Read Latency: 0.047931114482020164 ms.
Write Count: 17598
Write Latency: 0.053502954881236506 ms.
Pending Flushes: 0
Table: my_table
SSTable count: 84

Lagging Compaction
• Leveled: watch for SSTables remaining in L0:
Keyspace: my_keyspace
Read Count: 11207
Read Latency: 0.047931114482020164 ms.
Write Count: 17598
Write Latency: 0.053502954881236506 ms.
Pending Flushes: 0
Table: my_table
SSTable Count: 70
SSTables in each level: [50/4, 15/10, 5/100]
50 in L0 (should be 4)

Lagging Compaction Solution
• Triage:
– Check stats history to see if it’s a trend or a blip
– Increase compaction throughput using nodetool
setcompactionthroughput
– Temporarily switch to SizeTiered

Lagging Compaction Solution
• Triage:
– Check stats history to see if it’s a trend or a blip
– Increase compaction throughput using nodetool
setcompactionthroughput
– Temporarily switch to SizeTiered
• Do some digging:
– I/O problem?
– Add nodes?

Wide Rows / Hotspots
• Only takes one to wreak havoc

• It’s a data model problem

• Early detection is key!

• Early detection is key!
• Watch partition max bytes
– Make sure it doesn’t grow unbounded
– … or become significantly larger than mean bytes

• Use nodetool toppartitions to sample
reads/writes and find the offending partition

• Use nodetool toppartitions to sample
reads/writes and find the offending partition
• Take action early to avoid OOM issues with:
– Compaction
– Streaming
– Reads

For More Info…
(shameless book plug)

Thanks!
Robbie Strickland
rstrickland@weather.com
@rs_atl An IBM Business

Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Company) | C* Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Company) | C* Summit 2016

Similar to Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Company) | C* Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Company) | C* Summit 2016

Editor's Notes