Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees

298 views

Published on

Testing a complex system like Scylla is a challenge on its own. There are many environments, workloads, and problems. Simple problems become increasingly worse at scale. In this talk, we will explore the testing method that we employ in our QA lab and our plans to make it even better in years to come.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees

  1. 1. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Cry in the dojo, laugh in the battlefield: how we constantly try to bring Scylla to its knees so you don't have to. QA Manager, Scylla Roy Dahan
  2. 2. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Roy Dahan 2 Roy has over of 10 years of experience testing large-scale distributed systems, with a focus on storage/data systems, and managing small to large teams responsible for all testing aspects using a highly automated approach.
  3. 3. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Our Goal ▪ Achieving Highest Levels of System Stability & Availability ▪ Maintaining Data Integrity ▪ Prevent Performance Degradations Over Time ▪ Increase Users Confidence All of the above, even when BAD THINGS happen on “Production-like Environments” 3
  4. 4. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company How We Test Scylla 4 Scylla Testing Unit ✓ scylla-unittest Functional ✓ dtest Compatibility ✓ dtest ✓ Driver Tests Integration ✓ Janus-Graph Tests ✓ Titan-test ✓ Spark Scale / Performance ✓ S-C-T Stress / Load ✓ S-C-T ✓ Cassandra Stress System / Longevity ✓ S-C-T ✓ Jepsen
  5. 5. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Distributed Tests (dtest) ▪ Functional “Black Box” Tests ▪ Verifies our Compatibility with Cassandra ▪ Enhanced & Extended to Catch Scylla Regressions ▪ Around 10% (208) of the Reported Issues on the Scylla Project reference a dtest - (Detected/Reproduced by dtest) ▪ About 675 Tests Runs Regularly as part of “Regression Suite” 5
  6. 6. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Scylla-Cluster-Tests (SCT) ▪ Automation Library and Test Collection for Scylla & Cassandra Clusters ▪ Supports Multiple Backends such as: AWS / GCE / OpenStack / Libvirt ▪ Tests are Based on Chaos Engineering Principles: o Build a Hypothesis around Steady State Behavior o Vary Real-world Events o Automate Experiments to Run Continuously ▪ Around 4% (105) of the Reported Issues on the Scylla Project Reference SCT test - (Detected/Reproduced by SCT test) 6
  7. 7. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 7 Test Setup (Our Defaults): ▪ Cluster of N Scylla DB nodes (N=6) ▪ Set of X Loaders Nodes (x=2) ▪ Scylla Monitoring Server client Cluster of nodes client
  8. 8. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 8 Test Setup - Example on GCE: ▪
  9. 9. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 9 The Test flow: ▪ Client Side Loaders Run Workloads (Set of Cassandra-Stress loads run on the loaders (Write, Mixed, Counters, User Profiles) ▪ During X hours / days / weeks ▪ A “Nemesis” Out of the Predefined List is Randomly Selected o Some Nemesis Disrupts Nodes in the Cluster. o Someone Runs Standard Cluster Operations Current Nemesis types: StopStartService StopWaitStartService Drainer Decommission CorruptThenRepair CorruptThenRebuild NoCorruptRepair Refresh MajorCompaction ModifyTableProperties Enospc
  10. 10. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 10 Test Fixture Example: test_duration: 5760 stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5", "cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000", "cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"] n_db_nodes: 6 n_loaders: 2 n_monitor_nodes: 1 nemesis_class_name: 'ChaosMonkey' nemesis_interval: 5 failure_post_behavior: keep space_node_threshold: 644245094 ip_ssh_connections: 'private' experimental: 'true'
  11. 11. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 11 Test Fixture Example: test_duration: 5760 stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5", "cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000", "cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"] n_db_nodes: 6 n_loaders: 2 n_monitor_nodes: 1 nemesis_class_name: 'ChaosMonkey' nemesis_interval: 5 failure_post_behavior: keep space_node_threshold: 644245094 ip_ssh_connections: 'private' experimental: 'true'
  12. 12. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 12 Nemesis Code Examples: def disrupt_destroy_data_then_repair(self): self._set_current_disruption('CorruptThenRepair %s' % self.target_node) # Delete set of sstables from data directory self._destroy_data() # Try to save the node self.repair_nodetool_repair() def disrupt_stop_wait_start_scylla_server(self, sleep_time=300): self._set_current_disruption('StopWaitStartService %s' % self.target_node) self.target_node.remoter.run('sudo systemctl stop scylla-server.service') self.target_node.wait_db_down() self.log.info("Sleep for %s seconds", sleep_time) time.sleep(sleep_time) self.target_node.remoter.run('sudo systemctl start scylla-server.service') self.target_node.wait_db_up()
  13. 13. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 13 Test Verification & Analysis: ▪ Application Load (cassandra-stress) Doesn’t Stop ▪ Auto Detection of: • Coredumps • Errors • Exceptions • Operations failures (repair, add node, refresh, compaction, etc.) ▪ Auto Detection of Performance Degradations (unexpected lower throughput / higher latencies due to operations) ▪ Compare Nemesis Execution Durations Across Builds to Detect Possible Regressions
  14. 14. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 14 Longevity monitoring example: “Total Requests Served” (op/s) correlated with Nemesis executions.
  15. 15. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 15 Longevity monitoring example: “Requests Rate Served” (op/s per instance) correlated with Nemesis executions.
  16. 16. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 16 Longevity monitoring example: “CPU utilization” (% per instance) correlated with Nemesis executions.
  17. 17. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 17 Test Summary Output - Nemesis Execution: 50GB DataSet Test: (Nemesis every 5 minutes, 4 days) -------------------------------------------- | Nemesis Type |Count | Avg Time(s) | ------------------------------------------- | CorruptThenRebuild | 103 | 93.79 | | Decommission | 111 | 231.89 | | Drainer | 109 | 48.27 | | CorruptThenRepair | 113 | 285.71 | | Refresh | 95 | 7.72 | | NoCorruptRepair | 97 | 331.73 | | StopStartService | 133 | 26.92 | | MajorCompaction | 134 | 20.63 | | ModifyTable | 197 | 1.50 | | Enospc | 114 | 26.33 | | StopWaitStartService| 98 | 66.30 | -------------------------------------------- 1TB DataSet Test: (Nemesis every 30 minutes, 6 days) -------------------------------------------- | Nemesis Type |Count | Avg Time(s) | ------------------------------------------- | CorruptThenRebuild | 2 | 732.50 | | Decommission | 7 | 2913.86 | | Drainer | 6 | 213.00 | | CorruptThenRepair | 5 | 4942.60 | | Refresh | 6 | 10.50 | | NoCorruptRepair | 3 | 2835.33 | | StopStartService | 2 | 195.00 | | MajorCompaction | 3 | 663.33 | | ModifyTable | 6 | 4.67 | | Enospc | 6 | 221.00 | | StopWaitStartService| 6 | 492.17 | --------------------------------------------
  18. 18. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company 18 SCT Longevity Testing Nemesis Execution Analysis: Auto-analysis and reports based on test statistics stored automatically in ElasticSearch
  19. 19. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Example of Issue detected by Longevity 19
  20. 20. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Example of Nemesis Added due to Issue 20
  21. 21. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Example of Nemesis Added due to Issue 21 def disrupt_modify_table_comment(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) comment = ''.join(random.choice(string.ascii_letters) for i in xrange(24)) cmd = "ALTER TABLE keyspace1.standard1 with comment = '{}';".format(comment) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True) def disrupt_modify_table_gc_grace_time(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) gc_grace_seconds = random.choice(xrange(216000, 864000)) cmd = "ALTER TABLE keyspace1.standard1 with comment = 'gc_grace_seconds changed' AND" " gc_grace_seconds = {};".format(gc_grace_seconds) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True)
  22. 22. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Multi DC Longevity - The plot thickens 22 Test Setup (Our Defaults): ▪ Cluster of N Scylla DB nodes (N=15) ▪ Across M “Data Centers” (M=3) ▪ Set of X Loaders nodes. (X=3) ▪ Scylla Monitoring Server. ▪ Set of Cassandra-Stress commands running on the loaders (Write, Mixed, Counters, User Profiles). The tc utility is being used to impose random network delays, packet drops and reorder packets between Data Centers. DC1 client DC2 client DC3 client
  23. 23. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Performance Regression 23 ▪ Set of Predefined Workloads & Setups ○ Write ○ Read ○ Mixed ○ Customers Workloads ▪ Storing Results (Op/s, Throughput, Latency) in ElasticSearch ▪ Master Daily Regression Suite - Automatically Compare Results with a Previous Build & “Best” Build ▪ Release Regression Suite - Automatically Compare Results with Previous Releases (including RCs)
  24. 24. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Performance Regression 24 Test-Write - Total Op rate (op/s) by Release:
  25. 25. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Performance Regression 25 Test-Write - 99th Percentile Latency (ms) by Release:
  26. 26. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Large Scale Tests 26 ▪ 100’s of Nodes Clusters ▪ 10’s TB DataSets ▪ Multi-Core Scylla nodes ▪ Many sstables Sample of 101 nodes Scylla cluster running on AWS.
  27. 27. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company On QA Roadmap Longevity: ▪ Embed CharybdeFS (fault injection FS) in Longevity ▪ Extend workload types ▪ Two+ Nemesis in Parallel ▪ Adding more “Sudden Death” Types of Nemesis ▪ Enable “sstables integrity checker” Load & Scale ▪ XXL Clusters Sizes (1000+ nodes) ▪ Enhance Load Testing to More Server Dimensions (network, Disk) 27
  28. 28. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company On QA Roadmap Performance: ▪ Add more “Real World Workloads” to Daily Regressions ▪ Performance Impact Per Operation (e.g. repair, majorCompaction) ▪ Collecting Latency Histograms for Various Load Types 3rd Party Integration: ▪ Spark & Titan Integration Suites ▪ Java & Golang Driver Integration Suites Tools & Infrastructure: ▪ Enhance auto analysis based on Statistics in ElasticSearch ▪ Running SCT using an Existing Env 28
  29. 29. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company THANK YOU Roy@scylladb.com Please stay in touch Any questions?

×