• Save
Cassandra SF 2013 - In Case Of Emergency Break Glass
Upcoming SlideShare
Loading in...5
×
 

Cassandra SF 2013 - In Case Of Emergency Break Glass

on

  • 2,593 views

Cassandra SF 2013 - In Case Of Emergency Break Glass

Cassandra SF 2013 - In Case Of Emergency Break Glass

Statistics

Views

Total Views
2,593
Views on SlideShare
2,467
Embed Views
126

Actions

Likes
7
Downloads
0
Comments
0

1 Embed 126

https://twitter.com 126

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cassandra SF 2013 - In Case Of Emergency Break Glass Presentation Transcript

  • 1. CASSANDRA SUMMIT 2013IN CASE OF EMERGENCYBREAK GLASSAaron Morton@aaronmortonwww.thelastpickle.com#Cassandra13Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  • 2. About MeFreelance Cassandra ConsultantBased in Wellington, New ZealandApache Cassandra Committer#Cassandra13
  • 3. PlatformToolsProblemsMaintenance#Cassandra13
  • 4. The Platform#Cassandra13
  • 5. The Platform & Clients#Cassandra13
  • 6. The Platform & Running Clients#Cassandra13
  • 7. The Platform & RealityConsistencyAvailabilityPartition Tolerance#Cassandra13
  • 8. The Platform & ConsistencyStrong Consistency(R + W > N)Eventual Consistency(R + W <= N)#Cassandra13
  • 9. What Price Consistency?In a Multi DC cluster QUOURMand EACH_QUOURM involvecross DC latency.#Cassandra13
  • 10. The Platform & AvailabilityMaintain Consistency Level UPnodes for each Token Range.#Cassandra13
  • 11. Best Case Failure with N=9 and RF 3, 100% AvailabilityReplica 1Replica 2Replica 3Range A#Cassandra13
  • 12. Worst Case Failure with N=9 and RF 3, 78% AvailabilityRange BRange A#Cassandra13
  • 13. The Platform & PartitionToleranceA failed node does not createa partition.#Cassandra13
  • 14. The Platform & PartitionTolerance#Cassandra13
  • 15. The Platform & PartitionTolerancePartitions occur when thenetwork fails.#Cassandra13
  • 16. The Platform & PartitionTolerance#Cassandra13
  • 17. The Storage EngineOptimised forWrites.#Cassandra13
  • 18. Write PathAppend to Write Ahead Log.(fsync every 10s by default, other options available)#Cassandra13
  • 19. Write PathMerge new Columns intoMemtable.(Lock free, always in memory.)#Cassandra13
  • 20. Write Path... LaterAsynchronously flushMemtable to a new SSTable ondisk.(May be 10’s or 100’s of MB in size.)#Cassandra13
  • 21. SSTable Files*-Data.db*-Index.db*-Filter.db(And others)#Cassandra13
  • 22. Row FragmentationSSTable 1foo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2foo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3 SSTable 4foo:dishwasher (ts 15):tomaccoSSTable 5#Cassandra13
  • 23. Read PathRead columns from eachSSTable, then merge results.(Roughly speaking.)#Cassandra13
  • 24. Read PathUse Bloom Filter todetermine if a row key doesnot exist in a SSTable.(In memory)#Cassandra13
  • 25. Read PathSearch for prior key in*-Index.db sample.(In memory)#Cassandra13
  • 26. Read PathScan *-Index.db fromprior key to find the searchkey and its’ *-Data.dboffset.(On disk.)#Cassandra13
  • 27. Read PathRead *-Data.db fromoffset, all columns or specificpages.#Cassandra13
  • 28. Read purple, monkey, dishwasherSSTable 1-Data.dbfoo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2-Data.dbfoo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3-Data.db SSTable 4-Data.dbfoo:dishwasher (ts 15):tomaccoSSTable 5-Data.dbBloom FilterIndex SampleSSTable 1-Index.dbBloom FilterIndex SampleSSTable 2-Index.dbBloom FilterIndex SampleSSTable 3-Index.dbBloom FilterIndex SampleSSTable 4-Index.dbBloom FilterIndex SampleSSTable 5-Index.dbMemoryDisk#Cassandra13
  • 29. Read With Key CacheSSTable 1-Data.dbfoo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2-Data.dbfoo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3-Data.db SSTable 4-Data.dbfoo:dishwasher (ts 15):tomaccoSSTable 5-Data.dbKey CacheIndex SampleSSTable 1-Index.dbKey CacheIndex SampleSSTable 2-Index.dbKey CacheIndex SampleSSTable 3-Index.dbKey CacheIndex SampleSSTable 4-Index.dbKey CacheIndex SampleSSTable 5-Index.dbMemoryDiskBloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter#Cassandra13
  • 30. Read with Row CacheRow CacheSSTable 1-Data.dbfoo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2-Data.dbfoo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3-Data.db SSTable 4-Data.dbfoo:dishwasher (ts 15):tomaccoSSTable 5-Data.dbKey CacheIndex SampleSSTable 1-Index.dbKey CacheIndex SampleSSTable 2-Index.dbKey CacheIndex SampleSSTable 3-Index.dbKey CacheIndex SampleSSTable 4-Index.dbKey CacheIndex SampleSSTable 5-Index.dbMemoryDiskBloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter#Cassandra13
  • 31. Performant ReadsDesign queries to read from asmall number of SSTables.#Cassandra13
  • 32. Performant ReadsRead a small number ofnamed columns or a slice ofcolumns.#Cassandra13
  • 33. Performant ReadsDesign data model to supportcurrent applicationrequirements.#Cassandra13
  • 34. PlatformToolsProblemsMaintenance#Cassandra13
  • 35. LoggingConfigure vialog4j-server.propertiesandStorageServiceMBean#Cassandra13
  • 36. DEBUG Logging For One Classlog4j.logger.org.apache.cassandra.thrift.CassandraServer=DEBUG#Cassandra13
  • 37. Reading LogsINFO [OptionalTasks:1] 2013-04-20 14:03:50,787MeteredFlusher.java (line 62) flushing high-traffic columnfamily CFS(Keyspace=KS1, ColumnFamily=CF1) (estimated403858136 bytes)INFO [OptionalTasks:1] 2013-04-20 14:03:50,787ColumnFamilyStore.java (line 634) Enqueuing flush of Memtable-CF1@1333396270(145839277/403858136 serialized/live bytes,1742365 ops)INFO [FlushWriter:42] 2013-04-20 14:03:50,788 Memtable.java(line 266) Writing Memtable-CF1@1333396270(145839277/403858136serialized/live bytes, 1742365 ops)#Cassandra13
  • 38. GC Logscassandra-env.sh# GC logging options -- uncomment to enable# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"# JVM_OPTS="$JVM_OPTS -XX:+PrintHeapAtGC"# JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution"# JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"# JVM_OPTS="$JVM_OPTS -XX:+PrintPromotionFailure"# JVM_OPTS="$JVM_OPTS -XX:PrintFLSStatistics=1"# JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +%s`.log"#Cassandra13
  • 39. ParNew GC Starting{Heap before GC invocations=224115 (full 111):par new generation total 873856K, used 717289K ...)eden space 699136K, 100% used ...)from space 174720K, 10% used ...)to space 174720K, 0% used ...)#Cassandra13
  • 40. Tenuring Distribution240217.053: [ParNewDesired survivor size 89456640 bytes, new threshold 4 (max 4)- age 1: 22575936 bytes, 22575936 total- age 2: 350616 bytes, 22926552 total- age 3: 4380888 bytes, 27307440 total- age 4: 1155104 bytes, 28462544 total#Cassandra13
  • 41. ParNew GC FinishingHeap after GC invocations=224116 (full 111):par new generation total 873856K, used 31291K ...)eden space 699136K, 0% used ...)from space 174720K, 17% used ...)to space 174720K, 0% used ...)#Cassandra13
  • 42. nodetool infoToken : 0Gossip active : trueLoad : 130.64 GBGeneration No : 1369334297Uptime (seconds) : 29438Heap Memory (MB) : 3744.27 / 8025.38Data Center : eastRack : rack1Exceptions : 0Key Cache : size 104857584 (bytes), capacity 104857584(bytes), 25364985 hits, 34874180 requests, 0.734 recent hitrate, 14400 save period in secondsRow Cache : size 0 (bytes), capacity 0...#Cassandra13
  • 43. nodetool ringNote: Ownership information does not include topology, please specify a keyspace.Address DC Rack Status State Load Owns Token10.1.64.11 east rack1 Up Normal 130.64 GB 12.50% 010.1.65.8 west rack1 Up Normal 88.79 GB 0.00% 110.1.64.78 east rack1 Up Normal 52.66 GB 12.50% 212...21610.1.65.181 west rack1 Up Normal 65.99 GB 0.00% 212...21710.1.66.8 east rack1 Up Normal 64.38 GB 12.50% 425...43210.1.65.178 west rack1 Up Normal 77.94 GB 0.00% 425...43310.1.64.201 east rack1 Up Normal 56.42 GB 12.50% 638...64810.1.65.59 west rack1 Up Normal 74.5 GB 0.00% 638...64910.1.64.235 east rack1 Up Normal 79.68 GB 12.50% 850...86410.1.65.16 west rack1 Up Normal 62.05 GB 0.00% 850...86510.1.66.227 east rack1 Up Normal 106.73 GB 12.50% 106...08010.1.65.226 west rack1 Up Normal 79.26 GB 0.00% 106...08110.1.66.247 east rack1 Up Normal 66.68 GB 12.50% 127...29510.1.65.19 west rack1 Up Normal 102.45 GB 0.00% 127...29710.1.66.141 east rack1 Up Normal 53.72 GB 12.50% 148...51210.1.65.253 west rack1 Up Normal 54.25 GB 0.00% 148...513#Cassandra13
  • 44. nodetool ring KS1Address DC Rack Status State Load Effective-Ownership Token10.1.64.11 east rack1 Up Normal 130.72 GB 12.50% 010.1.65.8 west rack1 Up Normal 88.81 GB 12.50% 110.1.64.78 east rack1 Up Normal 52.68 GB 12.50% 212...21610.1.65.181 west rack1 Up Normal 66.01 GB 12.50% 212...21710.1.66.8 east rack1 Up Normal 64.4 GB 12.50% 425...43210.1.65.178 west rack1 Up Normal 77.96 GB 12.50% 425...43310.1.64.201 east rack1 Up Normal 56.44 GB 12.50% 638...64810.1.65.59 west rack1 Up Normal 74.57 GB 12.50% 638...64910.1.64.235 east rack1 Up Normal 79.72 GB 12.50% 850...86410.1.65.16 west rack1 Up Normal 62.12 GB 12.50% 850...86510.1.66.227 east rack1 Up Normal 106.72 GB 12.50% 106...08010.1.65.226 west rack1 Up Normal 79.28 GB 12.50% 106...08110.1.66.247 east rack1 Up Normal 66.73 GB 12.50% 127...29510.1.65.19 west rack1 Up Normal 102.47 GB 12.50% 127...29710.1.66.141 east rack1 Up Normal 53.75 GB 12.50% 148...51210.1.65.253 west rack1 Up Normal 54.24 GB 12.50% 148...513#Cassandra13
  • 45. nodetool status$ nodetool statusDatacenter: ams01 (Replication Factor 3)=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 10.70.48.23 38.38 GB 256 19.0% 7c5fdfad-63c6-4f37-bb9f-a66271aa3423 RAC1UN 10.70.6.78 58.13 GB 256 18.3% 94e7f48f-d902-4d4a-9b87-81ccd6aa9e65 RAC1UN 10.70.47.126 53.89 GB 256 19.4% f36f1f8c-1956-4850-8040-b58273277d83 RAC1Datacenter: wdc01 (Replication Factor 3)=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 10.24.116.66 65.81 GB 256 22.1% f9dba004-8c3d-4670-94a0-d301a9b775a8 RAC1UN 10.55.104.90 63.31 GB 256 21.2% 4746f1bd-85e1-4071-ae5e-9c5baac79469 RAC1UN 10.55.104.27 62.71 GB 256 21.2% 1a55cfd4-bb30-4250-b868-a9ae13d81ae1 RAC1#Cassandra13
  • 46. nodetool cfstatsKeyspace: KS1Column Family: CF1SSTable count: 11Space used (live): 32769179336Space used (total): 32769179336Number of Keys (estimate): 73728Memtable Columns Count: 1069137Memtable Data Size: 216442624Memtable Switch Count: 3Read Count: 95Read Latency: NaN ms.Write Count: 1039417Write Latency: 0.068 ms.Bloom Filter False Postives: 345Bloom Filter False Ratio: 0.00000Bloom Filter Space Used: 230096Compacted row minimum size: 150Compacted row maximum size: 322381140Compacted row mean size: 2072156#Cassandra13
  • 47. nodetool cfhistograms$nodetool cfhistograms KS1 CF1Offset SSTables Write Latency Read Latency Row Size Column Count1 67264 0 0 0 13315912 19512 0 0 0 42416863 35529 0 0 0 474784...10 10299 1150 0 0 2176812 5475 3569 0 0 399313514 1986 9098 0 0 143477817 258 30916 0 0 36689520 0 52980 0 0 18652424 0 104463 0 0 25439063...179 0 93 1823 1597 1284167215 0 84 3880 1231655 1147150258 0 170 5164 209282 956487#Cassandra13
  • 48. nodetool proxyhistograms$nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency60 0 15 072 0 51 086 0 241 0103 2 2003 0124 9 5798 0149 67 7348 0179 222 6453 0215 184 6071 0258 134 5436 0310 104 4936 0372 89 4997 0446 39 6383 0535 76797 7518 0642 9364748 96065 0770 16406421 152663 0924 7429538 97612 01109 6781835 176829 0#Cassandra13
  • 49. JMX via JConsole#Cassandra13
  • 50. JMX via MX4J#Cassandra13
  • 51. JMX via JMXTERM$ java -jar jmxterm-1.0-alpha-4-uber.jarWelcome to JMX terminal. Type "help" for available commands.$>open localhost:7199#Connection to localhost:7199 is opened$>bean org.apache.cassandra.db:type=StorageService#bean is set to org.apache.cassandra.db:type=StorageService$>info#mbean = org.apache.cassandra.db:type=StorageService#class name = org.apache.cassandra.service.StorageService# attributes%0 - AllDataFileLocations ([Ljava.lang.String;, r)%1 - CommitLogLocation (java.lang.String, r)%2 - CompactionThroughputMbPerSec (int, rw)...# operations%1 - void bulkLoad(java.lang.String p1)%2 - void clearSnapshot(java.lang.String p1,[Ljava.lang.String; p2)%3 - void decommission()#Cassandra13
  • 52. JVM Heap Dump via JMAPjmap -dump:format=b,file=heap.bin pid#Cassandra13
  • 53. JVM Heap Dump withYourKit#Cassandra13
  • 54. PlatformToolsProblemsMaintenance#Cassandra13
  • 55. Corrupt SSTable(Very rare.)#Cassandra13
  • 56. Compaction ErrorERROR [CompactionExecutor:36] 2013-04-29 07:50:49,060 AbstractCassandraDaemon.java(line 132) Exception in thread Thread[CompactionExecutor:36,1,main]java.lang.RuntimeException: Last written keyDecoratedKey(138024912283272996716128964353306009224, 6138633035613062     2d616666362d376330612d666531662d373738616630636265396535) >= current keyDecoratedKey(127065377405949402743383718901402082101,64323962636163652d646561372d333039322d386166322d663064346132363963386131) writinginto *-tmp-hf-7372-Data.dbatorg.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)atorg.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:160)atorg.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)at org.apache.cassandra.db.compaction.CompactionManager$2.runMayThrow(CompactionManager.java:164)#Cassandra13
  • 57. CauseChange in KeyValidator orbug in older versions.#Cassandra13
  • 58. Fixnodetool scrub#Cassandra13
  • 59. Dropped Messages#Cassandra13
  • 60. LogsMessagingService.java (line 658) 173 READ messages dropped in last 5000msStatusLogger.java (line 57) Pool Name Active PendingStatusLogger.java (line 72) ReadStage 32 284StatusLogger.java (line 72) RequestResponseStage 1 254StatusLogger.java (line 72) ReadRepairStage 0 0#Cassandra13
  • 61. nodetool tpstatsMessage type DroppedRANGE_SLICE 0READ_REPAIR 0BINARY 0READ 721MUTATION 1262REQUEST_RESPONSE 196#Cassandra13
  • 62. CausesExcessive GC.Overloaded IO.Overloaded Node.Wide Reads / Large Batches.#Cassandra13
  • 63. High Read Latency#Cassandra13
  • 64. nodetool infoToken : 113427455640312814857969558651062452225Gossip active : trueThrift active : trueLoad : 291.13 GBGeneration No : 1368569510Uptime (seconds) : 1022629Heap Memory (MB) : 5213.01 / 8025.38Data Center : 1Rack : 20Exceptions : 0Key Cache : size 104857584 (bytes), capacity 104857584 (bytes), 13436862hits, 16012159 requests, 0.907 recent hit rate, 14400 save period in secondsRow Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaNrecent hit rate, 0 save period in seconds#Cassandra13
  • 65. nodetool cfstatsColumn Family: page_viewsSSTable count: 17Space used (live): 289942843592Space used (total): 289942843592Number of Keys (estimate): 1071416832Memtable Columns Count: 2041888Memtable Data Size: 539015124Memtable Switch Count: 83Read Count: 267059Read Latency: NaN ms.Write Count: 10516969Write Latency: 0.054 ms.Pending Tasks: 0Bloom Filter False Positives: 128586Bloom Filter False Ratio: 0.00000Bloom Filter Space Used: 802906184Compacted row minimum size: 447Compacted row maximum size: 3973Compacted row mean size: 867#Cassandra13
  • 66. nodetool cfhistograms KS1 CF1Offset SSTables Write Latency Read Latency Row Size Column Count1 178437 0 0 0 02 20042 0 0 0 03 15275 0 0 0 04 11632 0 0 0 05 4771 0 0 0 06 4942 0 0 0 07 5540 0 0 0 08 4967 0 0 0 010 10682 0 0 0 28415512 8355 0 0 0 1537250814 1961 0 0 0 13795909617 322 3 0 0 62573393020 61 253 0 0 25295354724 53 15114 0 0 3910971829 18 255730 0 0 035 1 1532619 0 0 0...#Cassandra13
  • 67. nodetool cfhistograms KS1 CF1Offset SSTables Write Latency Read Latency Row Size Column Count446 0 120 233 0 0535 0 155 261 21361 0642 0 127 284 19082720 0770 0 88 218 498648801 0924 0 86 2699 504702186 01109 0 22 3157 48714564 01331 0 18 2818 241091 01597 0 15 2155 2165 01916 0 19 2098 7 02299 0 10 1140 56 02759 0 10 1281 0 03311 0 6 1064 0 03973 0 4 676 3 0...#Cassandra13
  • 68. jmx-term$ java -jar jmxterm-1.0-alpha-4-uber.jar Welcome to JMX terminal. Type "help" for available commands.$>open localhost:7199#Connection to localhost:7199 is opened$>bean org.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies#bean is set toorg.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies$>get BloomFilterFalseRatio#mbean =org.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies:BloomFilterFalseRatio = 0.5693801541828607;#Cassandra13
  • 69. Back to cfstatsColumn Family: page_viewsRead Count: 270075Bloom Filter False Positives: 131294#Cassandra13
  • 70. Causebloom_filter_fp_chance had been set to 0.1to reduce memory requirements whenstoring 1+ Billion rows per Node.#Cassandra13
  • 71. FixChanged read queries to select by columnname to limit SSTables per query.Long term, migrate to Cassandra v1.2 for offheap Bloom Filters.#Cassandra13
  • 72. GC Problems#Cassandra13
  • 73. WARNWARN [ScheduledTasks:1] 2013-03-29 18:40:48,158GCInspector.java (line 145) Heap is 0.9355130159566108 full.You may need to reduce memtable and/or cache sizes.INFO [ScheduledTasks:1] 2013-03-26 16:36:06,383GCInspector.java (line 122) GC for ConcurrentMarkSweep: 207 msfor 1 collections, 10105891032 used; max is 13591642112INFO [ScheduledTasks:1] 2013-03-28 22:18:17,113GCInspector.java (line 122) GC for ParNew: 256 ms for 1collections, 6504905688 used; max is 13591642112#Cassandra13
  • 74. Serious GC ProblemsINFO [ScheduledTasks:1] 2013-04-30 23:21:11,959GCInspector.java (line 122) GC for ParNew: 1115 ms for 1collections, 9355247296 used; max is 12801015808#Cassandra13
  • 75. Flapping NodeINFO [GossipTasks:1] 2013-03-28 17:42:07,944 Gossiper.java(line 830) InetAddress /10.1.20.144 is now dead.INFO [GossipStage:1] 2013-03-28 17:42:54,740 Gossiper.java(line 816) InetAddress /10.1.20.144 is now UPINFO [GossipTasks:1] 2013-03-28 17:46:00,585 Gossiper.java(line 830) InetAddress /10.1.20.144 is now dead.INFO [GossipStage:1] 2013-03-28 17:46:13,855 Gossiper.java(line 816) InetAddress /10.1.20.144 is now UPINFO [GossipStage:1] 2013-03-28 17:48:48,966 Gossiper.java(line 830) InetAddress /10.1.20.144 is now dead.#Cassandra13
  • 76. “GC Problems are the resultof workload andconfiguration.”Aaron Morton, Just Now.#Cassandra13
  • 77. Workload Correlation?Look for wide rows, largewrites, wide reads, un-bounded multi row reads orwrites.#Cassandra13
  • 78. Compaction Correlation?Slow down Compaction to improve stability.concurrent_compactors: 2compaction_throughput_mb_per_sec: 8in_memory_compaction_limit_in_mb: 32(Monitor and reverse when resolved.)#Cassandra13
  • 79. GC Logging InsightsSlow down rate of tenuring and enable fullGC logging.HEAP_NEWSIZE="1200M"JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=4"JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"#Cassandra13
  • 80. GC’ing Objects in ParNew{Heap before GC invocations=7937 (full 205):par new generation total 1024000K, used 830755K ...)eden space 819200K, 100% used ...)from space 204800K, 5% used ...)to space 204800K, 0% used ...)Desired survivor size 104857600 bytes, new threshold 4 (max 4)- age 1: 8090240 bytes, 8090240 total- age 2: 565016 bytes, 8655256 total- age 3: 330152 bytes, 8985408 total- age 4: 657840 bytes, 9643248 total#Cassandra13
  • 81. GC’ing Objects in ParNew{Heap before GC invocations=7938 (full 205):par new generation total 1024000K, used 835015K ...)eden space 819200K, 100% used ...)from space 204800K, 7% used ...)to space 204800K, 0% used ...)Desired survivor size 104857600 bytes, new threshold 4 (max 4)- age 1: 1315072 bytes, 1315072 total- age 2: 541072 bytes, 1856144 total- age 3: 499432 bytes, 2355576 total- age 4: 316808 bytes, 2672384 total#Cassandra13
  • 82. CauseNodes had wide rows & 1.3+Billion rows and 3+GB ofBloom Filters.(Using older bloom_filter_fp_chance of 0.000744.)#Cassandra13
  • 83. FixIncreased FP chance to 0.1 onone CF’s and .01 on others.(One CF reduced from 770MB to 170MB of Bloom Filters.)#Cassandra13
  • 84. FixIncreasedindex_interval from 128to 512.(Increased key_cache_size_in_mb to 200.)#Cassandra13
  • 85. FixMAX_HEAP_SIZE="8G"HEAP_NEWSIZE="1000M"-XX:SurvivorRatio=4"-XX:MaxTenuringThreshold=2"#Cassandra13
  • 86. Anatomy of a Partition.(From a 1.0 cluster)#Cassandra13
  • 87. Node 23 Was Upcassandra23# bin/nodetool -h localhost infoToken : 28356863910078205288614550619314017621Gossip active : trueLoad : 275.44 GBGeneration No : 1762556151Uptime (seconds) : 67548Heap Memory (MB) : 2926.44 / 8032.00Data Center : DC1Rack : RAC_unknownExceptions : 0#Cassandra13
  • 88. Other Nodes Saw It Downcassandra20# nodetool -h localhost ringAddress DC Rack Status State Load10.37.114.8 DC1 RAC20 Up Normal 285.86 GB10.29.60.10 DC2 RAC23 Down Normal 277.86 GB10.6.130.70 DC1 RAC21 Up Normal 244.9 GB10.29.60.14 DC2 RAC24 Up Normal 296.85 GB10.37.114.10 DC1 RAC22 Up Normal 255.81 GB10.29.60.12 DC2 RAC25 Up Normal 316.88 GB#Cassandra13
  • 89. And Node 23 SawThem Upcassandra23# nodetool -h localhost ringAddress DC Rack Status State Load10.37.114.8 DC1 RAC20 Up Normal 285.86 GB10.29.60.10 DC2 RAC23 Up Normal 277.86 GB10.6.130.70 DC1 RAC21 Up Normal 244.9 GB10.29.60.14 DC2 RAC24 Up Normal 296.85 GB10.37.114.10 DC1 RAC22 Up Normal 255.81 GB10.29.60.12 DC2 RAC25 Up Normal 316.88 GB#Cassandra13
  • 90. Still AvailableNode 23 could serve requests atLOCAL_QUORUM, QUORUM and ALLConsistency.Other nodes could serve requests atLOCAL_QUOURM and QUORUM but not ALLConsistency.#Cassandra13
  • 91. RelaxThe application was up.#Cassandra13
  • 92. Gossip?cassandra20# bin/nodetool -h localhost gossipinfo.../10.29.60.10LOAD:2.98347080902E11STATUS:NORMAL,28356863910078205288614550619314017621RPC_ADDRESS:10.29.60.10SCHEMA:fe933880-19bd-11e1-0000-5ff37d368cb6RELEASE_VERSION:1.0.5#Cassandra13
  • 93. Gossip Logs On Node 20?log4j.logger.org.apache.cassandra.gms.Gossiper=TRACETRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java(line 647) local heartbeat version 526912 greater than 7951for /10.29.60.10#Cassandra13
  • 94. More Gossip Logs On Node 20?log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACElog4j.logger.org.apache.cassandra.gms.FailureDetector=TRACETRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 46) Received a GossipDigestSynMessage from /10.29.60.10TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /10.29.60.14:1323732392:10208 /10.37.114.8:1323731527:11082 /10.37.114.10:1323736718:5830 /10.6.130.70:1323732220:10379 /10.29.60.12:1323733099:9493//Expected call to the FailureDetectorTRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 90) Sending a GossipDigestAckMessage to /10.29.60.10#Cassandra13
  • 95. Cause.Generation is initialised at bootstrap toseconds past the Epoch.1762556151 is Fri, 07 Nov 2025 22:55:51GMT.cassandra23# bin/nodetool -h localhost infoGeneration No : 1762556151TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /#Cassandra13
  • 96. Fix.[default@system] get LocationInfo[L];=> (column=ClusterName, value=737069, timestamp=1320437246450000)=> (column=Generation, value=690e78f6, timestamp=1762556150811000)#Cassandra13
  • 97. PlatformToolsProblemsMaintenance#Cassandra13
  • 98. MaintenanceExpand to Multi DC#Cassandra13
  • 99. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  • 100. DC Aware Snitch?SimpleSnitch puts allnodes in rack1 anddatacenter1.#Cassandra13
  • 101. More Snitches?PropertyFileSnitchRackInferringSnitch#Cassandra13
  • 102. Gossip Based Snitch?Ec2SnitchEc2MultiRegionSnitchGossipingPropertyFileSnitch*#Cassandra13
  • 103. Changing the SnitchDo Not change the DC orRack for an existing node.(Cassandra will not be able to find your data.)#Cassandra13
  • 104. Moving to the GossipingPropertyFileSnitchUpdate cassandra-topology.propertieson existing nodes with existing DC/Racksettings for all existing nodes.Set default to new DC.#Cassandra13
  • 105. Moving to the GossipingPropertyFileSnitchUpdate cassandra-rackdc.propertieson existing nodes with existing DC/Rack forthe node.#Cassandra13
  • 106. Moving to the GossipingPropertyFileSnitchUse a rolling restart to upgrade existing nodesto GossipingPropertyFileSnitch#Cassandra13
  • 107. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  • 108. Got NTS ?Must useNetworkTopologyStrategyfor Multi DC deployments.#Cassandra13
  • 109. SimpleStrategyOrder Token Ranges.Start with range that containsRow Key.Count to RF.#Cassandra13
  • 110. SimpleStrategy"foo"#Cassandra13
  • 111. NetworkTopologyStrategyOrder Token Ranges in the DC.Start with range that contains the Row Key.Add first unselected Token Range from eachRack.Repeat until RF selected.#Cassandra13
  • 112. NetworkTopologyStrategy"foo"Rack 1Rack 2Rack 3#Cassandra13
  • 113. NetworkTopologyStrategy & 1 Rack"foo"Rack 1#Cassandra13
  • 114. Changing the Replication StrategyBe Careful if using existingconfiguration has multipleRacks.(Cassandra may not be able to find your data.)#Cassandra13
  • 115. Changing the Replication StrategyUpdate Keyspace configuration to useNetworkTopologyStrategy withdatacenter1:3 and new_dc:0.#Cassandra13
  • 116. PreparingThe ClientDisable auto node discovery or use DCaware methods.Use LOCAL_QUOURM or EACH_QUOURM.#Cassandra13
  • 117. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  • 118. Configuring New NodesAdd auto_bootstrap: false tocassandra.yaml.Use GossipingPropertyFileSnitch.Three Seeds from each DC.(Use cluster_name as a safety.)#Cassandra13
  • 119. Configuring New NodesUpdate cassandra-rackdc.propertieson new nodes with new DC/Rack for thenode.(Ignore cassandra-topology.properties)#Cassandra13
  • 120. StartThe New NodesNew Nodes in the Ring in thenew DC without data ortraffic.#Cassandra13
  • 121. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  • 122. Change the Replication FactorUpdate Keyspace configuration to useNetworkTopologyStrategy withdataceter1:3 and new_dc:3.#Cassandra13
  • 123. Change the Replication FactorNew DC nodes will startreceiving writes from old DCcoordinators.#Cassandra13
  • 124. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  • 125. Y U No Bootstrap?DC 1 DC 2#Cassandra13
  • 126. nodetool rebuild DC1DC 1 DC 2#Cassandra13
  • 127. Rebuild CompleteNew Nodes now performing StrongConsistency reads.(If EACH_QUOURM used for writes.)#Cassandra13
  • 128. SummaryRelax.Understand the Platform andthe Tools.Always maintain Availability.#Cassandra13
  • 129. Thanks.#Cassandra13
  • 130. Aaron Morton@aaronmortonwww.thelastpickle.comLicensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License