Cassandra SF 2013 - In Case Of Emergency Break Glass

CASSANDRA SUMMIT 2013
IN CASE OF EMERGENCY
BREAK GLASS
Aaron Morton
@aaronmorton
www.thelastpickle.com
#Cassandra13
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

About Me
Freelance Cassandra Consultant
Based in Wellington, New Zealand
Apache Cassandra Committer
#Cassandra13

Platform
Tools
Problems
Maintenance
#Cassandra13

The Platform & Clients
#Cassandra13

The Platform & Running Clients
#Cassandra13

The Platform & Reality
Consistency
Availability
Partition Tolerance
#Cassandra13

The Platform & Consistency
Strong Consistency
(R + W > N)
Eventual Consistency
(R + W <= N)#Cassandra13

What Price Consistency?
In a Multi DC cluster QUOURM
and EACH_QUOURM involve
cross DC latency.
#Cassandra13

The Platform & Availability
Maintain Consistency Level UP
nodes for each Token Range.
#Cassandra13

Best Case Failure with N=9 and RF 3, 100% Availability
Replica 1
Replica 2
Replica 3
Range A
#Cassandra13

Worst Case Failure with N=9 and RF 3, 78% Availability
Range B
Range A
#Cassandra13

The Platform & PartitionTolerance
A failed node does not create
a partition.
#Cassandra13

#Cassandra13

Partitions occur when the
network fails.
#Cassandra13

The Storage Engine
Optimised for
Writes.
#Cassandra13

Write Path
Append to Write Ahead Log.
(fsync every 10s by default, other options available)
#Cassandra13

Write Path
Merge new Columns into
Memtable.
(Lock free, always in memory.)
#Cassandra13

Write Path... Later
Asynchronously ﬂush
Memtable to a new SSTable on
disk.
(May be 10’s or 100’s of MB in size.)
#Cassandra13

SSTable Files
*-Data.db
*-Index.db
*-Filter.db
(And others)
#Cassandra13

Row Fragmentation
SSTable 1
foo:
dishwasher (ts 10):
tomato
purple (ts 10):
cromulent
SSTable 2
foo:
frink (ts 20):
ﬂayven
monkey (ts 10):
embiggins
SSTable 3 SSTable 4
foo:
dishwasher (ts 15):
tomacco
SSTable 5
#Cassandra13

Read Path
Read columns from each
SSTable, then merge results.
(Roughly speaking.)
#Cassandra13

Read Path
Use Bloom Filter to
determine if a row key does
not exist in a SSTable.
(In memory)
#Cassandra13

Read Path
Search for prior key in
*-Index.db sample.
(In memory)
#Cassandra13

Read Path
Scan *-Index.db from
prior key to ﬁnd the search
key and its’ *-Data.db
offset.
(On disk.)
#Cassandra13

Read Path
Read *-Data.db from
offset, all columns or speciﬁc
pages.
#Cassandra13

Read purple, monkey, dishwasher
SSTable 1-Data.db
foo:
dishwasher (ts 10):
tomato
purple (ts 10):
cromulent
SSTable 2-Data.db
foo:
frink (ts 20):
ﬂayven
monkey (ts 10):
embiggins
SSTable 3-Data.db SSTable 4-Data.db
foo:
dishwasher (ts 15):
tomacco
SSTable 5-Data.db
Bloom Filter
Index Sample
SSTable 1-Index.db
Bloom Filter
Index Sample
SSTable 2-Index.db
Bloom Filter
Index Sample
SSTable 3-Index.db
Bloom Filter
Index Sample
SSTable 4-Index.db
Bloom Filter
Index Sample
SSTable 5-Index.db
Memory
Disk
#Cassandra13

Read With Key Cache
SSTable 1-Data.db
foo:
dishwasher (ts 10):
tomato
purple (ts 10):
cromulent
SSTable 2-Data.db
foo:
frink (ts 20):
ﬂayven
monkey (ts 10):
embiggins
foo:
dishwasher (ts 15):
tomacco
SSTable 5-Data.db
Key Cache
Index Sample
SSTable 1-Index.db
Key Cache
Index Sample
SSTable 2-Index.db
Key Cache
Index Sample
SSTable 3-Index.db
Key Cache
Index Sample
SSTable 4-Index.db
Key Cache
Index Sample
SSTable 5-Index.db
Memory
Disk
Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter
#Cassandra13

Read with Row Cache
Row Cache
SSTable 1-Data.db
foo:
dishwasher (ts 10):
tomato
purple (ts 10):
cromulent
SSTable 2-Data.db
foo:
frink (ts 20):
ﬂayven
monkey (ts 10):
embiggins
foo:
dishwasher (ts 15):
tomacco
SSTable 5-Data.db
Key Cache
Index Sample
SSTable 1-Index.db
Key Cache
Index Sample
SSTable 2-Index.db
Key Cache
Index Sample
SSTable 3-Index.db
Key Cache
Index Sample
SSTable 4-Index.db
Key Cache
Index Sample
SSTable 5-Index.db
Memory
Disk
Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter
#Cassandra13

Performant Reads
Design queries to read from a
small number of SSTables.
#Cassandra13

Performant Reads
Read a small number of
named columns or a slice of
columns.
#Cassandra13

Performant Reads
Design data model to support
current application
requirements.
#Cassandra13

Logging
Conﬁgure via
log4j-server.properties
and
StorageServiceMBean
#Cassandra13

DEBUG Logging For One Class
log4j.logger.org.apache.cassandra.thrift.
CassandraServer=DEBUG
#Cassandra13

Reading Logs
INFO [OptionalTasks:1] 2013-04-20 14:03:50,787
MeteredFlusher.java (line 62) flushing high-traffic column
family CFS(Keyspace='KS1', ColumnFamily='CF1') (estimated
403858136 bytes)
INFO [OptionalTasks:1] 2013-04-20 14:03:50,787
ColumnFamilyStore.java (line 634) Enqueuing flush of Memtable-
CF1@1333396270(145839277/403858136 serialized/live bytes,
1742365 ops)
INFO [FlushWriter:42] 2013-04-20 14:03:50,788 Memtable.java
(line 266) Writing Memtable-CF1@1333396270(145839277/403858136
serialized/live bytes, 1742365 ops)
#Cassandra13

GC Logs
cassandra-env.sh
# GC logging options -- uncomment to enable
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"
# JVM_OPTS="$JVM_OPTS -XX:+PrintHeapAtGC"
# JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution"
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"
# JVM_OPTS="$JVM_OPTS -XX:+PrintPromotionFailure"
# JVM_OPTS="$JVM_OPTS -XX:PrintFLSStatistics=1"
# JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +
%s`.log"
#Cassandra13

ParNew GC Starting
{Heap before GC invocations=224115 (full 111):
par new generation total 873856K, used 717289K ...)
eden space 699136K, 100% used ...)
from space 174720K, 10% used ...)
to space 174720K, 0% used ...)
#Cassandra13

Tenuring Distribution
240217.053: [ParNew
Desired survivor size 89456640 bytes, new threshold 4 (max 4)
- age 1: 22575936 bytes, 22575936 total
- age 2: 350616 bytes, 22926552 total
- age 3: 4380888 bytes, 27307440 total
- age 4: 1155104 bytes, 28462544 total
#Cassandra13

ParNew GC Finishing
Heap after GC invocations=224116 (full 111):
#Cassandra13

nodetool info
Token : 0
Gossip active : true
Load : 130.64 GB
Generation No : 1369334297
Uptime (seconds) : 29438
Heap Memory (MB) : 3744.27 / 8025.38
Data Center : east
Rack : rack1
Exceptions : 0
Key Cache : size 104857584 (bytes), capacity 104857584
(bytes), 25364985 hits, 34874180 requests, 0.734 recent hit
rate, 14400 save period in seconds
Row Cache : size 0 (bytes), capacity 0...
#Cassandra13

nodetool ring
Note: Ownership information does not include topology, please specify a keyspace.
Address DC Rack Status State Load Owns Token
10.1.64.11 east rack1 Up Normal 130.64 GB 12.50% 0
10.1.65.8 west rack1 Up Normal 88.79 GB 0.00% 1
10.1.64.78 east rack1 Up Normal 52.66 GB 12.50% 212...216
10.1.65.181 west rack1 Up Normal 65.99 GB 0.00% 212...217
#Cassandra13

nodetool ring KS1
Address DC Rack Status State Load Effective-Ownership Token
10.1.64.11 east rack1 Up Normal 130.72 GB 12.50% 0
10.1.65.8 west rack1 Up Normal 88.81 GB 12.50% 1
#Cassandra13

nodetool status
$ nodetool status
Datacenter: ams01 (Replication Factor 3)
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.70.48.23 38.38 GB 256 19.0% 7c5fdfad-63c6-4f37-bb9f-a66271aa3423 RAC1
UN 10.70.6.78 58.13 GB 256 18.3% 94e7f48f-d902-4d4a-9b87-81ccd6aa9e65 RAC1
UN 10.70.47.126 53.89 GB 256 19.4% f36f1f8c-1956-4850-8040-b58273277d83 RAC1
Datacenter: wdc01 (Replication Factor 3)
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.24.116.66 65.81 GB 256 22.1% f9dba004-8c3d-4670-94a0-d301a9b775a8 RAC1
UN 10.55.104.90 63.31 GB 256 21.2% 4746f1bd-85e1-4071-ae5e-9c5baac79469 RAC1
UN 10.55.104.27 62.71 GB 256 21.2% 1a55cfd4-bb30-4250-b868-a9ae13d81ae1 RAC1
#Cassandra13

nodetool cfstats
Keyspace: KS1
Column Family: CF1
SSTable count: 11
Space used (live): 32769179336
Space used (total): 32769179336
Number of Keys (estimate): 73728
Memtable Columns Count: 1069137
Memtable Data Size: 216442624
Memtable Switch Count: 3
Read Count: 95
Read Latency: NaN ms.
Write Count: 1039417
Write Latency: 0.068 ms.
Bloom Filter False Postives: 345
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 230096
Compacted row minimum size: 150
Compacted row maximum size: 322381140
Compacted row mean size: 2072156
#Cassandra13

nodetool cfhistograms
$nodetool cfhistograms KS1 CF1
Offset SSTables Write Latency Read Latency Row Size Column Count
1 67264 0 0 0 1331591
2 19512 0 0 0 4241686
3 35529 0 0 0 474784
...
10 10299 1150 0 0 21768
12 5475 3569 0 0 3993135
14 1986 9098 0 0 1434778
17 258 30916 0 0 366895
20 0 52980 0 0 186524
24 0 104463 0 0 25439063
...
179 0 93 1823 1597 1284167
215 0 84 3880 1231655 1147150
258 0 170 5164 209282 956487
#Cassandra13

nodetool proxyhistograms
$nodetool proxyhistograms
Offset Read Latency Write Latency Range Latency
60 0 15 0
72 0 51 0
86 0 241 0
103 2 2003 0
124 9 5798 0
149 67 7348 0
179 222 6453 0
215 184 6071 0
258 134 5436 0
310 104 4936 0
372 89 4997 0
446 39 6383 0
535 76797 7518 0
642 9364748 96065 0
770 16406421 152663 0
924 7429538 97612 0
1109 6781835 176829 0
#Cassandra13

JMX via JMXTERM
$ java -jar jmxterm-1.0-alpha-4-uber.jar
Welcome to JMX terminal. Type "help" for available commands.
$>open localhost:7199
#Connection to localhost:7199 is opened
$>bean org.apache.cassandra.db:type=StorageService
#bean is set to org.apache.cassandra.db:type=StorageService
$>info
#mbean = org.apache.cassandra.db:type=StorageService
#class name = org.apache.cassandra.service.StorageService
# attributes
%0 - AllDataFileLocations ([Ljava.lang.String;, r)
%1 - CommitLogLocation (java.lang.String, r)
%2 - CompactionThroughputMbPerSec (int, rw)
...
# operations
%1 - void bulkLoad(java.lang.String p1)
%2 - void clearSnapshot(java.lang.String p1,[Ljava.lang.String; p2)
%3 - void decommission()
#Cassandra13

JVM Heap Dump via JMAP
jmap -dump:format=b,
file=heap.bin pid
#Cassandra13

JVM Heap Dump withYourKit
#Cassandra13

Corrupt SSTable
(Very rare.)
#Cassandra13

Compaction Error
ERROR [CompactionExecutor:36] 2013-04-29 07:50:49,060 AbstractCassandraDaemon.java
(line 132) Exception in thread Thread[CompactionExecutor:36,1,main]
java.lang.RuntimeException: Last written key
DecoratedKey(138024912283272996716128964353306009224, 6138633035613062
2d616666362d376330612d666531662d373738616630636265396535) >= current key
DecoratedKey(127065377405949402743383718901402082101,
64323962636163652d646561372d333039322d386166322d663064346132363963386131) writing
into *-tmp-hf-7372-Data.db
at
org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)
at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)
at
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:160)
at
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompaction
Task.java:50)
at org.apache.cassandra.db.compaction.CompactionManager
$2.runMayThrow(CompactionManager.java:164)
#Cassandra13

Cause
Change in KeyValidator or
bug in older versions.
#Cassandra13

Fix
nodetool scrub
#Cassandra13

Logs
MessagingService.java (line 658) 173 READ messages dropped in last 5000ms
StatusLogger.java (line 57) Pool Name Active Pending
StatusLogger.java (line 72) ReadStage 32 284
StatusLogger.java (line 72) RequestResponseStage 1 254
StatusLogger.java (line 72) ReadRepairStage 0 0
#Cassandra13

nodetool tpstats
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
BINARY 0
READ 721
MUTATION 1262
REQUEST_RESPONSE 196
#Cassandra13

Causes
Excessive GC.
Overloaded IO.
Overloaded Node.
Wide Reads / Large Batches.
#Cassandra13

High Read Latency
#Cassandra13

nodetool info
Token : 113427455640312814857969558651062452225
Thrift active : true
Load : 291.13 GB
Heap Memory (MB) : 5213.01 / 8025.38
Data Center : 1
Rack : 20
Exceptions : 0
Key Cache : size 104857584 (bytes), capacity 104857584 (bytes), 13436862
hits, 16012159 requests, 0.907 recent hit rate, 14400 save period in seconds
Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN
recent hit rate, 0 save period in seconds
#Cassandra13

nodetool cfstats
Column Family: page_views
SSTable count: 17
Space used (live): 289942843592
Space used (total): 289942843592
Number of Keys (estimate): 1071416832
Memtable Columns Count: 2041888
Memtable Data Size: 539015124
Memtable Switch Count: 83
Read Count: 267059
Read Latency: NaN ms.
Write Count: 10516969
Write Latency: 0.054 ms.
Pending Tasks: 0
Bloom Filter False Positives: 128586
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 802906184
Compacted row minimum size: 447
Compacted row maximum size: 3973
Compacted row mean size: 867
#Cassandra13

nodetool cfhistograms KS1 CF1
1 178437 0 0 0 0
2 20042 0 0 0 0
3 15275 0 0 0 0
4 11632 0 0 0 0
5 4771 0 0 0 0
6 4942 0 0 0 0
7 5540 0 0 0 0
8 4967 0 0 0 0
10 10682 0 0 0 284155
12 8355 0 0 0 15372508
14 1961 0 0 0 137959096
17 322 3 0 0 625733930
20 61 253 0 0 252953547
24 53 15114 0 0 39109718
29 18 255730 0 0 0
35 1 1532619 0 0 0
...
#Cassandra13

nodetool cfhistograms KS1 CF1
446 0 120 233 0 0
535 0 155 261 21361 0
642 0 127 284 19082720 0
770 0 88 218 498648801 0
924 0 86 2699 504702186 0
1109 0 22 3157 48714564 0
1331 0 18 2818 241091 0
1597 0 15 2155 2165 0
1916 0 19 2098 7 0
2299 0 10 1140 56 0
2759 0 10 1281 0 0
3311 0 6 1064 0 0
3973 0 4 676 3 0
...
#Cassandra13

jmx-term
$ java -jar jmxterm-1.0-alpha-4-uber.jar
Welcome to JMX terminal. Type "help" for available commands.
$>open localhost:7199
#Connection to localhost:7199 is opened
$>bean org.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies
#bean is set to
org.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies
$>get BloomFilterFalseRatio
#mbean =
org.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies:
BloomFilterFalseRatio = 0.5693801541828607;
#Cassandra13

Back to cfstats
Column Family: page_views
Read Count: 270075
Bloom Filter False Positives: 131294
#Cassandra13

Cause
bloom_ﬁlter_fp_chance had been set to 0.1
to reduce memory requirements when
storing 1+ Billion rows per Node.
#Cassandra13

Fix
Changed read queries to select by column
name to limit SSTables per query.
Long term, migrate to Cassandra v1.2 for off
heap Bloom Filters.
#Cassandra13

WARN
WARN [ScheduledTasks:1] 2013-03-29 18:40:48,158
GCInspector.java (line 145) Heap is 0.9355130159566108 full.
You may need to reduce memtable and/or cache sizes.
INFO [ScheduledTasks:1] 2013-03-26 16:36:06,383
GCInspector.java (line 122) GC for ConcurrentMarkSweep: 207 ms
for 1 collections, 10105891032 used; max is 13591642112
GCInspector.java (line 122) GC for ParNew: 256 ms for 1
collections, 6504905688 used; max is 13591642112
#Cassandra13

Serious GC Problems
GCInspector.java (line 122) GC for ParNew: 1115 ms for 1
collections, 9355247296 used; max is 12801015808
#Cassandra13

Flapping Node
INFO [GossipTasks:1] 2013-03-28 17:42:07,944 Gossiper.java
(line 830) InetAddress /10.1.20.144 is now dead.
INFO [GossipStage:1] 2013-03-28 17:42:54,740 Gossiper.java
(line 816) InetAddress /10.1.20.144 is now UP
INFO [GossipTasks:1] 2013-03-28 17:46:00,585 Gossiper.java
(line 816) InetAddress /10.1.20.144 is now UP
#Cassandra13

“GC Problems are the result
of workload and
conﬁguration.”
Aaron Morton, Just Now.
#Cassandra13

Workload Correlation?
Look for wide rows, large
writes, wide reads, un-
bounded multi row reads or
writes.
#Cassandra13

Compaction Correlation?
Slow down Compaction to improve stability.
concurrent_compactors: 2
compaction_throughput_mb_per_sec: 8
in_memory_compaction_limit_in_mb: 32
(Monitor and reverse when resolved.)
#Cassandra13

GC Logging Insights
Slow down rate of tenuring and enable full
GC logging.
HEAP_NEWSIZE="1200M"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=4"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"
#Cassandra13

GC’ing Objects in ParNew
- age 1: 8090240 bytes, 8090240 total
- age 2: 565016 bytes, 8655256 total
- age 3: 330152 bytes, 8985408 total
- age 4: 657840 bytes, 9643248 total
#Cassandra13

GC’ing Objects in ParNew
- age 1: 1315072 bytes, 1315072 total
- age 2: 541072 bytes, 1856144 total
- age 3: 499432 bytes, 2355576 total
- age 4: 316808 bytes, 2672384 total
#Cassandra13

Cause
Nodes had wide rows & 1.3+
Billion rows and 3+GB of
Bloom Filters.
(Using older bloom_filter_fp_chance of 0.000744.)
#Cassandra13

Fix
Increased FP chance to 0.1 on
one CF’s and .01 on others.
(One CF reduced from 770MB to 170MB of Bloom Filters.)
#Cassandra13

Fix
Increased
index_interval from 128
to 512.
(Increased key_cache_size_in_mb to 200.)
#Cassandra13

Fix
MAX_HEAP_SIZE="8G"
HEAP_NEWSIZE="1000M"
-XX:SurvivorRatio=4"
-XX:MaxTenuringThreshold=2"
#Cassandra13

Anatomy of a Partition.
(From a 1.0 cluster)
#Cassandra13

Node 23 Was Up
cassandra23# bin/nodetool -h localhost info
Token : 28356863910078205288614550619314017621
Load : 275.44 GB
Heap Memory (MB) : 2926.44 / 8032.00
Data Center : DC1
Rack : RAC_unknown
Exceptions : 0
#Cassandra13

Other Nodes Saw It Down
cassandra20# nodetool -h localhost ring
Address DC Rack Status State Load
10.37.114.8 DC1 RAC20 Up Normal 285.86 GB
10.29.60.10 DC2 RAC23 Down Normal 277.86 GB
#Cassandra13

And Node 23 SawThem Up
cassandra23# nodetool -h localhost ring
Address DC Rack Status State Load
#Cassandra13

Still Available
Node 23 could serve requests at
LOCAL_QUORUM, QUORUM and ALL
Consistency.
Other nodes could serve requests at
LOCAL_QUOURM and QUORUM but not ALL
Consistency.
#Cassandra13

Relax
The application was up.
#Cassandra13

Gossip?
cassandra20# bin/nodetool -h localhost gossipinfo
...
/10.29.60.10
LOAD:2.98347080902E11
STATUS:NORMAL,28356863910078205288614550619314017621
RPC_ADDRESS:10.29.60.10
SCHEMA:fe933880-19bd-11e1-0000-5ff37d368cb6
RELEASE_VERSION:1.0.5
#Cassandra13

Gossip Logs On Node 20?
log4j.logger.org.apache.cassandra.gms.Gossiper=TRACE
TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java
(line 647) local heartbeat version 526912 greater than 7951
for /10.29.60.10
#Cassandra13

More Gossip Logs On Node 20?
log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE
TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java
(line 46) Received a GossipDigestSynMessage from /10.29.60.10
(line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /
10.29.60.14:1323732392:10208 /10.37.114.8:1323731527:11082 /
10.37.114.10:1323736718:5830 /10.6.130.70:1323732220:10379 /
10.29.60.12:1323733099:9493
//Expected call to the FailureDetector
(line 90) Sending a GossipDigestAckMessage to /10.29.60.10
#Cassandra13

Cause.
Generation is initialised at bootstrap to
seconds past the Epoch.
1762556151 is Fri, 07 Nov 2025 22:55:51
GMT.
cassandra23# bin/nodetool -h localhost info
(line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /
#Cassandra13

Fix.
[default@system] get LocationInfo['L'];
=> (column=ClusterName, value=737069, timestamp=1320437246450000)
=> (column=Generation, value=690e78f6, timestamp=1762556150811000)
#Cassandra13

Maintenance
Expand to Multi DC
#Cassandra13

Expand to Multi DC
Update Snitch
Update Replication Strategy
Add Nodes
Update Replication Factor
Rebuild
#Cassandra13

DC Aware Snitch?
SimpleSnitch puts all
nodes in rack1 and
datacenter1.
#Cassandra13

More Snitches?
PropertyFileSnitch
RackInferringSnitch
#Cassandra13

Gossip Based Snitch?
Ec2Snitch
Ec2MultiRegionSnitch
GossipingPropertyFileSnitch*
#Cassandra13

Changing the Snitch
Do Not change the DC or
Rack for an existing node.
(Cassandra will not be able to ﬁnd your data.)
#Cassandra13

Moving to the GossipingPropertyFileSnitch
Update cassandra-
topology.properties
on existing nodes with existing DC/Rack
settings for all existing nodes.
Set default to new DC.
#Cassandra13

Update cassandra-
rackdc.properties
on existing nodes with existing DC/Rack for
the node.
#Cassandra13

Use a rolling restart to upgrade existing nodes
to GossipingPropertyFileSnitch
#Cassandra13

Got NTS ?
Must use
NetworkTopologyStrategy
for Multi DC deployments.
#Cassandra13

SimpleStrategy
Order Token Ranges.
Start with range that contains
Row Key.
Count to RF.
#Cassandra13

SimpleStrategy
"foo"
#Cassandra13

Order Token Ranges in the DC.
Start with range that contains the Row Key.
Add ﬁrst unselected Token Range from each
Rack.
Repeat until RF selected.
#Cassandra13

"foo"
Rack 1
Rack 2Rack 3
#Cassandra13

NetworkTopologyStrategy & 1 Rack
"foo"
Rack 1
#Cassandra13

Changing the Replication Strategy
Be Careful if using existing
conﬁguration has multiple
Racks.
(Cassandra may not be able to ﬁnd your data.)
#Cassandra13

Changing the Replication Strategy
Update Keyspace conﬁguration to use
NetworkTopologyStrategy with
datacenter1:3 and new_dc:0.
#Cassandra13

PreparingThe Client
Disable auto node discovery or use DC
aware methods.
Use LOCAL_QUOURM or EACH_QUOURM.
#Cassandra13

Conﬁguring New Nodes
Add auto_bootstrap: false to
cassandra.yaml.
Use GossipingPropertyFileSnitch.
Three Seeds from each DC.
(Use cluster_name as a safety.)
#Cassandra13

Conﬁguring New Nodes
Update cassandra-
rackdc.properties
on new nodes with new DC/Rack for the
node.
(Ignore cassandra-topology.properties)
#Cassandra13

StartThe New Nodes
New Nodes in the Ring in the
new DC without data or
trafﬁc.
#Cassandra13

Change the Replication Factor
Update Keyspace conﬁguration to use
NetworkTopologyStrategy with
dataceter1:3 and new_dc:3.
#Cassandra13

Change the Replication Factor
New DC nodes will start
receiving writes from old DC
coordinators.
#Cassandra13

Y U No Bootstrap?
DC 1 DC 2
#Cassandra13

nodetool rebuild DC1
DC 1 DC 2
#Cassandra13

Rebuild Complete
New Nodes now performing Strong
Consistency reads.
(If EACH_QUOURM used for writes.)
#Cassandra13

Summary
Relax.
Understand the Platform and
the Tools.
Always maintain Availability.
#Cassandra13

Aaron Morton
@aaronmorton
www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

Cassandra SF 2013 - In Case Of Emergency Break Glass

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cassandra SF 2013 - In Case Of Emergency Break Glass

Similar to Cassandra SF 2013 - In Case Of Emergency Break Glass (20)

More from aaronmorton

More from aaronmorton (18)

Recently uploaded

Recently uploaded (20)

Cassandra SF 2013 - In Case Of Emergency Break Glass