Cassandra SF 2013 - In Case Of Emergency Break Glass

3,190 views
3,075 views

Published on

Cassandra SF 2013 - In Case Of Emergency Break Glass

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,190
On SlideShare
0
From Embeds
0
Number of Embeds
82
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Cassandra SF 2013 - In Case Of Emergency Break Glass

  1. 1. CASSANDRA SUMMIT 2013IN CASE OF EMERGENCYBREAK GLASSAaron Morton@aaronmortonwww.thelastpickle.com#Cassandra13Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  2. 2. About MeFreelance Cassandra ConsultantBased in Wellington, New ZealandApache Cassandra Committer#Cassandra13
  3. 3. PlatformToolsProblemsMaintenance#Cassandra13
  4. 4. The Platform#Cassandra13
  5. 5. The Platform & Clients#Cassandra13
  6. 6. The Platform & Running Clients#Cassandra13
  7. 7. The Platform & RealityConsistencyAvailabilityPartition Tolerance#Cassandra13
  8. 8. The Platform & ConsistencyStrong Consistency(R + W > N)Eventual Consistency(R + W <= N)#Cassandra13
  9. 9. What Price Consistency?In a Multi DC cluster QUOURMand EACH_QUOURM involvecross DC latency.#Cassandra13
  10. 10. The Platform & AvailabilityMaintain Consistency Level UPnodes for each Token Range.#Cassandra13
  11. 11. Best Case Failure with N=9 and RF 3, 100% AvailabilityReplica 1Replica 2Replica 3Range A#Cassandra13
  12. 12. Worst Case Failure with N=9 and RF 3, 78% AvailabilityRange BRange A#Cassandra13
  13. 13. The Platform & PartitionToleranceA failed node does not createa partition.#Cassandra13
  14. 14. The Platform & PartitionTolerance#Cassandra13
  15. 15. The Platform & PartitionTolerancePartitions occur when thenetwork fails.#Cassandra13
  16. 16. The Platform & PartitionTolerance#Cassandra13
  17. 17. The Storage EngineOptimised forWrites.#Cassandra13
  18. 18. Write PathAppend to Write Ahead Log.(fsync every 10s by default, other options available)#Cassandra13
  19. 19. Write PathMerge new Columns intoMemtable.(Lock free, always in memory.)#Cassandra13
  20. 20. Write Path... LaterAsynchronously flushMemtable to a new SSTable ondisk.(May be 10’s or 100’s of MB in size.)#Cassandra13
  21. 21. SSTable Files*-Data.db*-Index.db*-Filter.db(And others)#Cassandra13
  22. 22. Row FragmentationSSTable 1foo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2foo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3 SSTable 4foo:dishwasher (ts 15):tomaccoSSTable 5#Cassandra13
  23. 23. Read PathRead columns from eachSSTable, then merge results.(Roughly speaking.)#Cassandra13
  24. 24. Read PathUse Bloom Filter todetermine if a row key doesnot exist in a SSTable.(In memory)#Cassandra13
  25. 25. Read PathSearch for prior key in*-Index.db sample.(In memory)#Cassandra13
  26. 26. Read PathScan *-Index.db fromprior key to find the searchkey and its’ *-Data.dboffset.(On disk.)#Cassandra13
  27. 27. Read PathRead *-Data.db fromoffset, all columns or specificpages.#Cassandra13
  28. 28. Read purple, monkey, dishwasherSSTable 1-Data.dbfoo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2-Data.dbfoo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3-Data.db SSTable 4-Data.dbfoo:dishwasher (ts 15):tomaccoSSTable 5-Data.dbBloom FilterIndex SampleSSTable 1-Index.dbBloom FilterIndex SampleSSTable 2-Index.dbBloom FilterIndex SampleSSTable 3-Index.dbBloom FilterIndex SampleSSTable 4-Index.dbBloom FilterIndex SampleSSTable 5-Index.dbMemoryDisk#Cassandra13
  29. 29. Read With Key CacheSSTable 1-Data.dbfoo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2-Data.dbfoo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3-Data.db SSTable 4-Data.dbfoo:dishwasher (ts 15):tomaccoSSTable 5-Data.dbKey CacheIndex SampleSSTable 1-Index.dbKey CacheIndex SampleSSTable 2-Index.dbKey CacheIndex SampleSSTable 3-Index.dbKey CacheIndex SampleSSTable 4-Index.dbKey CacheIndex SampleSSTable 5-Index.dbMemoryDiskBloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter#Cassandra13
  30. 30. Read with Row CacheRow CacheSSTable 1-Data.dbfoo:dishwasher (ts 10):tomatopurple (ts 10):cromulentSSTable 2-Data.dbfoo:frink (ts 20):flayvenmonkey (ts 10):embigginsSSTable 3-Data.db SSTable 4-Data.dbfoo:dishwasher (ts 15):tomaccoSSTable 5-Data.dbKey CacheIndex SampleSSTable 1-Index.dbKey CacheIndex SampleSSTable 2-Index.dbKey CacheIndex SampleSSTable 3-Index.dbKey CacheIndex SampleSSTable 4-Index.dbKey CacheIndex SampleSSTable 5-Index.dbMemoryDiskBloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter#Cassandra13
  31. 31. Performant ReadsDesign queries to read from asmall number of SSTables.#Cassandra13
  32. 32. Performant ReadsRead a small number ofnamed columns or a slice ofcolumns.#Cassandra13
  33. 33. Performant ReadsDesign data model to supportcurrent applicationrequirements.#Cassandra13
  34. 34. PlatformToolsProblemsMaintenance#Cassandra13
  35. 35. LoggingConfigure vialog4j-server.propertiesandStorageServiceMBean#Cassandra13
  36. 36. DEBUG Logging For One Classlog4j.logger.org.apache.cassandra.thrift.CassandraServer=DEBUG#Cassandra13
  37. 37. Reading LogsINFO [OptionalTasks:1] 2013-04-20 14:03:50,787MeteredFlusher.java (line 62) flushing high-traffic columnfamily CFS(Keyspace=KS1, ColumnFamily=CF1) (estimated403858136 bytes)INFO [OptionalTasks:1] 2013-04-20 14:03:50,787ColumnFamilyStore.java (line 634) Enqueuing flush of Memtable-CF1@1333396270(145839277/403858136 serialized/live bytes,1742365 ops)INFO [FlushWriter:42] 2013-04-20 14:03:50,788 Memtable.java(line 266) Writing Memtable-CF1@1333396270(145839277/403858136serialized/live bytes, 1742365 ops)#Cassandra13
  38. 38. GC Logscassandra-env.sh# GC logging options -- uncomment to enable# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"# JVM_OPTS="$JVM_OPTS -XX:+PrintHeapAtGC"# JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution"# JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"# JVM_OPTS="$JVM_OPTS -XX:+PrintPromotionFailure"# JVM_OPTS="$JVM_OPTS -XX:PrintFLSStatistics=1"# JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +%s`.log"#Cassandra13
  39. 39. ParNew GC Starting{Heap before GC invocations=224115 (full 111):par new generation total 873856K, used 717289K ...)eden space 699136K, 100% used ...)from space 174720K, 10% used ...)to space 174720K, 0% used ...)#Cassandra13
  40. 40. Tenuring Distribution240217.053: [ParNewDesired survivor size 89456640 bytes, new threshold 4 (max 4)- age 1: 22575936 bytes, 22575936 total- age 2: 350616 bytes, 22926552 total- age 3: 4380888 bytes, 27307440 total- age 4: 1155104 bytes, 28462544 total#Cassandra13
  41. 41. ParNew GC FinishingHeap after GC invocations=224116 (full 111):par new generation total 873856K, used 31291K ...)eden space 699136K, 0% used ...)from space 174720K, 17% used ...)to space 174720K, 0% used ...)#Cassandra13
  42. 42. nodetool infoToken : 0Gossip active : trueLoad : 130.64 GBGeneration No : 1369334297Uptime (seconds) : 29438Heap Memory (MB) : 3744.27 / 8025.38Data Center : eastRack : rack1Exceptions : 0Key Cache : size 104857584 (bytes), capacity 104857584(bytes), 25364985 hits, 34874180 requests, 0.734 recent hitrate, 14400 save period in secondsRow Cache : size 0 (bytes), capacity 0...#Cassandra13
  43. 43. nodetool ringNote: Ownership information does not include topology, please specify a keyspace.Address DC Rack Status State Load Owns Token10.1.64.11 east rack1 Up Normal 130.64 GB 12.50% 010.1.65.8 west rack1 Up Normal 88.79 GB 0.00% 110.1.64.78 east rack1 Up Normal 52.66 GB 12.50% 212...21610.1.65.181 west rack1 Up Normal 65.99 GB 0.00% 212...21710.1.66.8 east rack1 Up Normal 64.38 GB 12.50% 425...43210.1.65.178 west rack1 Up Normal 77.94 GB 0.00% 425...43310.1.64.201 east rack1 Up Normal 56.42 GB 12.50% 638...64810.1.65.59 west rack1 Up Normal 74.5 GB 0.00% 638...64910.1.64.235 east rack1 Up Normal 79.68 GB 12.50% 850...86410.1.65.16 west rack1 Up Normal 62.05 GB 0.00% 850...86510.1.66.227 east rack1 Up Normal 106.73 GB 12.50% 106...08010.1.65.226 west rack1 Up Normal 79.26 GB 0.00% 106...08110.1.66.247 east rack1 Up Normal 66.68 GB 12.50% 127...29510.1.65.19 west rack1 Up Normal 102.45 GB 0.00% 127...29710.1.66.141 east rack1 Up Normal 53.72 GB 12.50% 148...51210.1.65.253 west rack1 Up Normal 54.25 GB 0.00% 148...513#Cassandra13
  44. 44. nodetool ring KS1Address DC Rack Status State Load Effective-Ownership Token10.1.64.11 east rack1 Up Normal 130.72 GB 12.50% 010.1.65.8 west rack1 Up Normal 88.81 GB 12.50% 110.1.64.78 east rack1 Up Normal 52.68 GB 12.50% 212...21610.1.65.181 west rack1 Up Normal 66.01 GB 12.50% 212...21710.1.66.8 east rack1 Up Normal 64.4 GB 12.50% 425...43210.1.65.178 west rack1 Up Normal 77.96 GB 12.50% 425...43310.1.64.201 east rack1 Up Normal 56.44 GB 12.50% 638...64810.1.65.59 west rack1 Up Normal 74.57 GB 12.50% 638...64910.1.64.235 east rack1 Up Normal 79.72 GB 12.50% 850...86410.1.65.16 west rack1 Up Normal 62.12 GB 12.50% 850...86510.1.66.227 east rack1 Up Normal 106.72 GB 12.50% 106...08010.1.65.226 west rack1 Up Normal 79.28 GB 12.50% 106...08110.1.66.247 east rack1 Up Normal 66.73 GB 12.50% 127...29510.1.65.19 west rack1 Up Normal 102.47 GB 12.50% 127...29710.1.66.141 east rack1 Up Normal 53.75 GB 12.50% 148...51210.1.65.253 west rack1 Up Normal 54.24 GB 12.50% 148...513#Cassandra13
  45. 45. nodetool status$ nodetool statusDatacenter: ams01 (Replication Factor 3)=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 10.70.48.23 38.38 GB 256 19.0% 7c5fdfad-63c6-4f37-bb9f-a66271aa3423 RAC1UN 10.70.6.78 58.13 GB 256 18.3% 94e7f48f-d902-4d4a-9b87-81ccd6aa9e65 RAC1UN 10.70.47.126 53.89 GB 256 19.4% f36f1f8c-1956-4850-8040-b58273277d83 RAC1Datacenter: wdc01 (Replication Factor 3)=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 10.24.116.66 65.81 GB 256 22.1% f9dba004-8c3d-4670-94a0-d301a9b775a8 RAC1UN 10.55.104.90 63.31 GB 256 21.2% 4746f1bd-85e1-4071-ae5e-9c5baac79469 RAC1UN 10.55.104.27 62.71 GB 256 21.2% 1a55cfd4-bb30-4250-b868-a9ae13d81ae1 RAC1#Cassandra13
  46. 46. nodetool cfstatsKeyspace: KS1Column Family: CF1SSTable count: 11Space used (live): 32769179336Space used (total): 32769179336Number of Keys (estimate): 73728Memtable Columns Count: 1069137Memtable Data Size: 216442624Memtable Switch Count: 3Read Count: 95Read Latency: NaN ms.Write Count: 1039417Write Latency: 0.068 ms.Bloom Filter False Postives: 345Bloom Filter False Ratio: 0.00000Bloom Filter Space Used: 230096Compacted row minimum size: 150Compacted row maximum size: 322381140Compacted row mean size: 2072156#Cassandra13
  47. 47. nodetool cfhistograms$nodetool cfhistograms KS1 CF1Offset SSTables Write Latency Read Latency Row Size Column Count1 67264 0 0 0 13315912 19512 0 0 0 42416863 35529 0 0 0 474784...10 10299 1150 0 0 2176812 5475 3569 0 0 399313514 1986 9098 0 0 143477817 258 30916 0 0 36689520 0 52980 0 0 18652424 0 104463 0 0 25439063...179 0 93 1823 1597 1284167215 0 84 3880 1231655 1147150258 0 170 5164 209282 956487#Cassandra13
  48. 48. nodetool proxyhistograms$nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency60 0 15 072 0 51 086 0 241 0103 2 2003 0124 9 5798 0149 67 7348 0179 222 6453 0215 184 6071 0258 134 5436 0310 104 4936 0372 89 4997 0446 39 6383 0535 76797 7518 0642 9364748 96065 0770 16406421 152663 0924 7429538 97612 01109 6781835 176829 0#Cassandra13
  49. 49. JMX via JConsole#Cassandra13
  50. 50. JMX via MX4J#Cassandra13
  51. 51. JMX via JMXTERM$ java -jar jmxterm-1.0-alpha-4-uber.jarWelcome to JMX terminal. Type "help" for available commands.$>open localhost:7199#Connection to localhost:7199 is opened$>bean org.apache.cassandra.db:type=StorageService#bean is set to org.apache.cassandra.db:type=StorageService$>info#mbean = org.apache.cassandra.db:type=StorageService#class name = org.apache.cassandra.service.StorageService# attributes%0 - AllDataFileLocations ([Ljava.lang.String;, r)%1 - CommitLogLocation (java.lang.String, r)%2 - CompactionThroughputMbPerSec (int, rw)...# operations%1 - void bulkLoad(java.lang.String p1)%2 - void clearSnapshot(java.lang.String p1,[Ljava.lang.String; p2)%3 - void decommission()#Cassandra13
  52. 52. JVM Heap Dump via JMAPjmap -dump:format=b,file=heap.bin pid#Cassandra13
  53. 53. JVM Heap Dump withYourKit#Cassandra13
  54. 54. PlatformToolsProblemsMaintenance#Cassandra13
  55. 55. Corrupt SSTable(Very rare.)#Cassandra13
  56. 56. Compaction ErrorERROR [CompactionExecutor:36] 2013-04-29 07:50:49,060 AbstractCassandraDaemon.java(line 132) Exception in thread Thread[CompactionExecutor:36,1,main]java.lang.RuntimeException: Last written keyDecoratedKey(138024912283272996716128964353306009224, 6138633035613062     2d616666362d376330612d666531662d373738616630636265396535) >= current keyDecoratedKey(127065377405949402743383718901402082101,64323962636163652d646561372d333039322d386166322d663064346132363963386131) writinginto *-tmp-hf-7372-Data.dbatorg.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)atorg.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:160)atorg.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)at org.apache.cassandra.db.compaction.CompactionManager$2.runMayThrow(CompactionManager.java:164)#Cassandra13
  57. 57. CauseChange in KeyValidator orbug in older versions.#Cassandra13
  58. 58. Fixnodetool scrub#Cassandra13
  59. 59. Dropped Messages#Cassandra13
  60. 60. LogsMessagingService.java (line 658) 173 READ messages dropped in last 5000msStatusLogger.java (line 57) Pool Name Active PendingStatusLogger.java (line 72) ReadStage 32 284StatusLogger.java (line 72) RequestResponseStage 1 254StatusLogger.java (line 72) ReadRepairStage 0 0#Cassandra13
  61. 61. nodetool tpstatsMessage type DroppedRANGE_SLICE 0READ_REPAIR 0BINARY 0READ 721MUTATION 1262REQUEST_RESPONSE 196#Cassandra13
  62. 62. CausesExcessive GC.Overloaded IO.Overloaded Node.Wide Reads / Large Batches.#Cassandra13
  63. 63. High Read Latency#Cassandra13
  64. 64. nodetool infoToken : 113427455640312814857969558651062452225Gossip active : trueThrift active : trueLoad : 291.13 GBGeneration No : 1368569510Uptime (seconds) : 1022629Heap Memory (MB) : 5213.01 / 8025.38Data Center : 1Rack : 20Exceptions : 0Key Cache : size 104857584 (bytes), capacity 104857584 (bytes), 13436862hits, 16012159 requests, 0.907 recent hit rate, 14400 save period in secondsRow Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaNrecent hit rate, 0 save period in seconds#Cassandra13
  65. 65. nodetool cfstatsColumn Family: page_viewsSSTable count: 17Space used (live): 289942843592Space used (total): 289942843592Number of Keys (estimate): 1071416832Memtable Columns Count: 2041888Memtable Data Size: 539015124Memtable Switch Count: 83Read Count: 267059Read Latency: NaN ms.Write Count: 10516969Write Latency: 0.054 ms.Pending Tasks: 0Bloom Filter False Positives: 128586Bloom Filter False Ratio: 0.00000Bloom Filter Space Used: 802906184Compacted row minimum size: 447Compacted row maximum size: 3973Compacted row mean size: 867#Cassandra13
  66. 66. nodetool cfhistograms KS1 CF1Offset SSTables Write Latency Read Latency Row Size Column Count1 178437 0 0 0 02 20042 0 0 0 03 15275 0 0 0 04 11632 0 0 0 05 4771 0 0 0 06 4942 0 0 0 07 5540 0 0 0 08 4967 0 0 0 010 10682 0 0 0 28415512 8355 0 0 0 1537250814 1961 0 0 0 13795909617 322 3 0 0 62573393020 61 253 0 0 25295354724 53 15114 0 0 3910971829 18 255730 0 0 035 1 1532619 0 0 0...#Cassandra13
  67. 67. nodetool cfhistograms KS1 CF1Offset SSTables Write Latency Read Latency Row Size Column Count446 0 120 233 0 0535 0 155 261 21361 0642 0 127 284 19082720 0770 0 88 218 498648801 0924 0 86 2699 504702186 01109 0 22 3157 48714564 01331 0 18 2818 241091 01597 0 15 2155 2165 01916 0 19 2098 7 02299 0 10 1140 56 02759 0 10 1281 0 03311 0 6 1064 0 03973 0 4 676 3 0...#Cassandra13
  68. 68. jmx-term$ java -jar jmxterm-1.0-alpha-4-uber.jar Welcome to JMX terminal. Type "help" for available commands.$>open localhost:7199#Connection to localhost:7199 is opened$>bean org.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies#bean is set toorg.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies$>get BloomFilterFalseRatio#mbean =org.apache.cassandra.db:columnfamily=CF2,keyspace=KS2,type=ColumnFamilies:BloomFilterFalseRatio = 0.5693801541828607;#Cassandra13
  69. 69. Back to cfstatsColumn Family: page_viewsRead Count: 270075Bloom Filter False Positives: 131294#Cassandra13
  70. 70. Causebloom_filter_fp_chance had been set to 0.1to reduce memory requirements whenstoring 1+ Billion rows per Node.#Cassandra13
  71. 71. FixChanged read queries to select by columnname to limit SSTables per query.Long term, migrate to Cassandra v1.2 for offheap Bloom Filters.#Cassandra13
  72. 72. GC Problems#Cassandra13
  73. 73. WARNWARN [ScheduledTasks:1] 2013-03-29 18:40:48,158GCInspector.java (line 145) Heap is 0.9355130159566108 full.You may need to reduce memtable and/or cache sizes.INFO [ScheduledTasks:1] 2013-03-26 16:36:06,383GCInspector.java (line 122) GC for ConcurrentMarkSweep: 207 msfor 1 collections, 10105891032 used; max is 13591642112INFO [ScheduledTasks:1] 2013-03-28 22:18:17,113GCInspector.java (line 122) GC for ParNew: 256 ms for 1collections, 6504905688 used; max is 13591642112#Cassandra13
  74. 74. Serious GC ProblemsINFO [ScheduledTasks:1] 2013-04-30 23:21:11,959GCInspector.java (line 122) GC for ParNew: 1115 ms for 1collections, 9355247296 used; max is 12801015808#Cassandra13
  75. 75. Flapping NodeINFO [GossipTasks:1] 2013-03-28 17:42:07,944 Gossiper.java(line 830) InetAddress /10.1.20.144 is now dead.INFO [GossipStage:1] 2013-03-28 17:42:54,740 Gossiper.java(line 816) InetAddress /10.1.20.144 is now UPINFO [GossipTasks:1] 2013-03-28 17:46:00,585 Gossiper.java(line 830) InetAddress /10.1.20.144 is now dead.INFO [GossipStage:1] 2013-03-28 17:46:13,855 Gossiper.java(line 816) InetAddress /10.1.20.144 is now UPINFO [GossipStage:1] 2013-03-28 17:48:48,966 Gossiper.java(line 830) InetAddress /10.1.20.144 is now dead.#Cassandra13
  76. 76. “GC Problems are the resultof workload andconfiguration.”Aaron Morton, Just Now.#Cassandra13
  77. 77. Workload Correlation?Look for wide rows, largewrites, wide reads, un-bounded multi row reads orwrites.#Cassandra13
  78. 78. Compaction Correlation?Slow down Compaction to improve stability.concurrent_compactors: 2compaction_throughput_mb_per_sec: 8in_memory_compaction_limit_in_mb: 32(Monitor and reverse when resolved.)#Cassandra13
  79. 79. GC Logging InsightsSlow down rate of tenuring and enable fullGC logging.HEAP_NEWSIZE="1200M"JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=4"JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"#Cassandra13
  80. 80. GC’ing Objects in ParNew{Heap before GC invocations=7937 (full 205):par new generation total 1024000K, used 830755K ...)eden space 819200K, 100% used ...)from space 204800K, 5% used ...)to space 204800K, 0% used ...)Desired survivor size 104857600 bytes, new threshold 4 (max 4)- age 1: 8090240 bytes, 8090240 total- age 2: 565016 bytes, 8655256 total- age 3: 330152 bytes, 8985408 total- age 4: 657840 bytes, 9643248 total#Cassandra13
  81. 81. GC’ing Objects in ParNew{Heap before GC invocations=7938 (full 205):par new generation total 1024000K, used 835015K ...)eden space 819200K, 100% used ...)from space 204800K, 7% used ...)to space 204800K, 0% used ...)Desired survivor size 104857600 bytes, new threshold 4 (max 4)- age 1: 1315072 bytes, 1315072 total- age 2: 541072 bytes, 1856144 total- age 3: 499432 bytes, 2355576 total- age 4: 316808 bytes, 2672384 total#Cassandra13
  82. 82. CauseNodes had wide rows & 1.3+Billion rows and 3+GB ofBloom Filters.(Using older bloom_filter_fp_chance of 0.000744.)#Cassandra13
  83. 83. FixIncreased FP chance to 0.1 onone CF’s and .01 on others.(One CF reduced from 770MB to 170MB of Bloom Filters.)#Cassandra13
  84. 84. FixIncreasedindex_interval from 128to 512.(Increased key_cache_size_in_mb to 200.)#Cassandra13
  85. 85. FixMAX_HEAP_SIZE="8G"HEAP_NEWSIZE="1000M"-XX:SurvivorRatio=4"-XX:MaxTenuringThreshold=2"#Cassandra13
  86. 86. Anatomy of a Partition.(From a 1.0 cluster)#Cassandra13
  87. 87. Node 23 Was Upcassandra23# bin/nodetool -h localhost infoToken : 28356863910078205288614550619314017621Gossip active : trueLoad : 275.44 GBGeneration No : 1762556151Uptime (seconds) : 67548Heap Memory (MB) : 2926.44 / 8032.00Data Center : DC1Rack : RAC_unknownExceptions : 0#Cassandra13
  88. 88. Other Nodes Saw It Downcassandra20# nodetool -h localhost ringAddress DC Rack Status State Load10.37.114.8 DC1 RAC20 Up Normal 285.86 GB10.29.60.10 DC2 RAC23 Down Normal 277.86 GB10.6.130.70 DC1 RAC21 Up Normal 244.9 GB10.29.60.14 DC2 RAC24 Up Normal 296.85 GB10.37.114.10 DC1 RAC22 Up Normal 255.81 GB10.29.60.12 DC2 RAC25 Up Normal 316.88 GB#Cassandra13
  89. 89. And Node 23 SawThem Upcassandra23# nodetool -h localhost ringAddress DC Rack Status State Load10.37.114.8 DC1 RAC20 Up Normal 285.86 GB10.29.60.10 DC2 RAC23 Up Normal 277.86 GB10.6.130.70 DC1 RAC21 Up Normal 244.9 GB10.29.60.14 DC2 RAC24 Up Normal 296.85 GB10.37.114.10 DC1 RAC22 Up Normal 255.81 GB10.29.60.12 DC2 RAC25 Up Normal 316.88 GB#Cassandra13
  90. 90. Still AvailableNode 23 could serve requests atLOCAL_QUORUM, QUORUM and ALLConsistency.Other nodes could serve requests atLOCAL_QUOURM and QUORUM but not ALLConsistency.#Cassandra13
  91. 91. RelaxThe application was up.#Cassandra13
  92. 92. Gossip?cassandra20# bin/nodetool -h localhost gossipinfo.../10.29.60.10LOAD:2.98347080902E11STATUS:NORMAL,28356863910078205288614550619314017621RPC_ADDRESS:10.29.60.10SCHEMA:fe933880-19bd-11e1-0000-5ff37d368cb6RELEASE_VERSION:1.0.5#Cassandra13
  93. 93. Gossip Logs On Node 20?log4j.logger.org.apache.cassandra.gms.Gossiper=TRACETRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java(line 647) local heartbeat version 526912 greater than 7951for /10.29.60.10#Cassandra13
  94. 94. More Gossip Logs On Node 20?log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACElog4j.logger.org.apache.cassandra.gms.FailureDetector=TRACETRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 46) Received a GossipDigestSynMessage from /10.29.60.10TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /10.29.60.14:1323732392:10208 /10.37.114.8:1323731527:11082 /10.37.114.10:1323736718:5830 /10.6.130.70:1323732220:10379 /10.29.60.12:1323733099:9493//Expected call to the FailureDetectorTRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 90) Sending a GossipDigestAckMessage to /10.29.60.10#Cassandra13
  95. 95. Cause.Generation is initialised at bootstrap toseconds past the Epoch.1762556151 is Fri, 07 Nov 2025 22:55:51GMT.cassandra23# bin/nodetool -h localhost infoGeneration No : 1762556151TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java(line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /#Cassandra13
  96. 96. Fix.[default@system] get LocationInfo[L];=> (column=ClusterName, value=737069, timestamp=1320437246450000)=> (column=Generation, value=690e78f6, timestamp=1762556150811000)#Cassandra13
  97. 97. PlatformToolsProblemsMaintenance#Cassandra13
  98. 98. MaintenanceExpand to Multi DC#Cassandra13
  99. 99. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  100. 100. DC Aware Snitch?SimpleSnitch puts allnodes in rack1 anddatacenter1.#Cassandra13
  101. 101. More Snitches?PropertyFileSnitchRackInferringSnitch#Cassandra13
  102. 102. Gossip Based Snitch?Ec2SnitchEc2MultiRegionSnitchGossipingPropertyFileSnitch*#Cassandra13
  103. 103. Changing the SnitchDo Not change the DC orRack for an existing node.(Cassandra will not be able to find your data.)#Cassandra13
  104. 104. Moving to the GossipingPropertyFileSnitchUpdate cassandra-topology.propertieson existing nodes with existing DC/Racksettings for all existing nodes.Set default to new DC.#Cassandra13
  105. 105. Moving to the GossipingPropertyFileSnitchUpdate cassandra-rackdc.propertieson existing nodes with existing DC/Rack forthe node.#Cassandra13
  106. 106. Moving to the GossipingPropertyFileSnitchUse a rolling restart to upgrade existing nodesto GossipingPropertyFileSnitch#Cassandra13
  107. 107. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  108. 108. Got NTS ?Must useNetworkTopologyStrategyfor Multi DC deployments.#Cassandra13
  109. 109. SimpleStrategyOrder Token Ranges.Start with range that containsRow Key.Count to RF.#Cassandra13
  110. 110. SimpleStrategy"foo"#Cassandra13
  111. 111. NetworkTopologyStrategyOrder Token Ranges in the DC.Start with range that contains the Row Key.Add first unselected Token Range from eachRack.Repeat until RF selected.#Cassandra13
  112. 112. NetworkTopologyStrategy"foo"Rack 1Rack 2Rack 3#Cassandra13
  113. 113. NetworkTopologyStrategy & 1 Rack"foo"Rack 1#Cassandra13
  114. 114. Changing the Replication StrategyBe Careful if using existingconfiguration has multipleRacks.(Cassandra may not be able to find your data.)#Cassandra13
  115. 115. Changing the Replication StrategyUpdate Keyspace configuration to useNetworkTopologyStrategy withdatacenter1:3 and new_dc:0.#Cassandra13
  116. 116. PreparingThe ClientDisable auto node discovery or use DCaware methods.Use LOCAL_QUOURM or EACH_QUOURM.#Cassandra13
  117. 117. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  118. 118. Configuring New NodesAdd auto_bootstrap: false tocassandra.yaml.Use GossipingPropertyFileSnitch.Three Seeds from each DC.(Use cluster_name as a safety.)#Cassandra13
  119. 119. Configuring New NodesUpdate cassandra-rackdc.propertieson new nodes with new DC/Rack for thenode.(Ignore cassandra-topology.properties)#Cassandra13
  120. 120. StartThe New NodesNew Nodes in the Ring in thenew DC without data ortraffic.#Cassandra13
  121. 121. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  122. 122. Change the Replication FactorUpdate Keyspace configuration to useNetworkTopologyStrategy withdataceter1:3 and new_dc:3.#Cassandra13
  123. 123. Change the Replication FactorNew DC nodes will startreceiving writes from old DCcoordinators.#Cassandra13
  124. 124. Expand to Multi DCUpdate SnitchUpdate Replication StrategyAdd NodesUpdate Replication FactorRebuild#Cassandra13
  125. 125. Y U No Bootstrap?DC 1 DC 2#Cassandra13
  126. 126. nodetool rebuild DC1DC 1 DC 2#Cassandra13
  127. 127. Rebuild CompleteNew Nodes now performing StrongConsistency reads.(If EACH_QUOURM used for writes.)#Cassandra13
  128. 128. SummaryRelax.Understand the Platform andthe Tools.Always maintain Availability.#Cassandra13
  129. 129. Thanks.#Cassandra13
  130. 130. Aaron Morton@aaronmortonwww.thelastpickle.comLicensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

×