Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Testing Cassandra
Guarantees under
Diverse Failure
Modes with Jepsen
Joel Knighton
@joelknighton
DataStax
#CassandraSummit
Who I am
Mathematician
Software hobbyist
Logic enthusiast
Former DataStaxIntern
DataStax Cassandra Developer
What I Do
Deconstruct
Formalize
Communicate
Prove
Automate
How We Test #1
Unit Tests
ant test
in-tree
How We Test #2
Distributed Tests
nosetests
On GitHub – available at riptano/cassandra-dtest
Why You’re Here
Jepsen
Kyle Kingsbury (aphyr)
https://aphyr.com/tags/jepsen
What Jepsen Is
A blog series about distributed systems behavior
A talk series about distributed systems behavior
A Clojure...
What We Hope
Jepsen
💘
Cassandra
What I Did
Jepsen Tests
lein test
On GitHub – available at riptano/jepsen
A Test Incarnate
{:name …
:os …
:db …
:client …
:generator …
:conductors {:nemesis …}
:checker …}
names the results
prepar...
What You Need
One machine to run the tests
+
n machines to run Cassandra
How A Test Runs
lein test
os
n1
n2
n3
n4
n5
How A Test Runs
lein test
db
n1
n2
n3
n4
n5
How A Test Runs
lein test
client 1
client 2
client 3
client 4
client 5
nemesis
n1
n2
n3
n4
n5
read
write 3
start nemesis
w...
How A Test Runs
lein test
checker
1 – read
2 – write 3
1 – read 0
n – start nemesis
2 – write timed-out
3 – write 4
n – st...
Single Test Deep-Dive
lein test :only
cassandra.collections.set-test/
cql-set-isolate-node-decommission
Single Test Name
Test name used to label folder where
test results, logs, and history will be
stored with timestamp
cassan...
Single Test Nodes
[:n1 :n2 :n3 :n4 :n5]
Single Test Net
net/iptables
(drop! ;use iptables to drop packets)
(heal! ;flush iptables)
Single Test OS
debian/os
(setup! ;adjust hostfile
;update package manager
;install base packages like curl, iptables, etc....
Single Test DB
cassandra.core/db
(setup! ;shutdown and wipe Cassandra if running
;install, configure, and start Cassandra)...
Single Test Client
cql-set-client
(setup! ;driver connect to all nodes
;create schema)
(invoke! ;add? Run CQL to add to se...
Single Test Generator
(gen/phases
(->> (adds)
(gen/stagger 1/10)
(gen/delay 1/2)
std-gen)
(read-once))
Single Test Conductors
{:nemesis (nemesis/partition-random-node)
:decommissioner (c/decommissioner)}
What a Conductor Is
It’s just a client
Single Test Checker
checker/set
(check ;look at history of run
;find ok or uncertain adds
;compare these to final read
;re...
Invariants We Test
Do CQL collections (maps, sets) merge cleanlywhen add-only?
Do counters merge to accuratelyreflect incr...
Failures We Consider
How does this work under a variety of network partitions?
What about with node crashes?
Even if nodes...
How We Run
Start the Docker container
Install Java driver, Cassaforte, clj-ssh, and Jepsen
Use environment variables to po...
Tunable Options
Should we make a best-effort attempt to scale test length?
Should we enable commitlog compression, the coo...
What We’ve Found
Issues with counter undercounting/overcounting(#10143)
Decommission race conditions causing gossip proble...
Work We Shared
Minor Jepsen fixes/features (Jepsen PRs #58, 59, 62)
Docker images to run Jepsen tests (Docker Hub: tjake/j...
Jepsen on CassCI
Lessons I Learned
Tests verifying invariants under failures are valuable and practical
These tests can and should be a par...
Thanks
Jake Luciani
DataStax
The Cassandra community
Kyle Kingsbury
QUESTIONS?
TLA+ • TLC • TLAPS • Clojure
Formal Methods • Jepsen
CRDTs • Cassandra • Gossip
Consistency Models • Alloy
Mode...
Upcoming SlideShare
Loading in …5
×

Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

1,795 views

Published on

Testing Cassandra with Jepsen

Published in: Software
  • Be the first to comment

Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen

  1. 1. Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen Joel Knighton @joelknighton DataStax #CassandraSummit
  2. 2. Who I am Mathematician Software hobbyist Logic enthusiast Former DataStaxIntern DataStax Cassandra Developer
  3. 3. What I Do Deconstruct Formalize Communicate Prove Automate
  4. 4. How We Test #1 Unit Tests ant test in-tree
  5. 5. How We Test #2 Distributed Tests nosetests On GitHub – available at riptano/cassandra-dtest
  6. 6. Why You’re Here Jepsen Kyle Kingsbury (aphyr) https://aphyr.com/tags/jepsen
  7. 7. What Jepsen Is A blog series about distributed systems behavior A talk series about distributed systems behavior A Clojure library to test the behavior of distributed systems A collection of tests written using those libraries
  8. 8. What We Hope Jepsen 💘 Cassandra
  9. 9. What I Did Jepsen Tests lein test On GitHub – available at riptano/jepsen
  10. 10. A Test Incarnate {:name … :os … :db … :client … :generator … :conductors {:nemesis …} :checker …} names the results prepares the os configures/starts/stops the db interacts with the db instructions on how to interact interacts with the environment looks at and assesses test run
  11. 11. What You Need One machine to run the tests + n machines to run Cassandra
  12. 12. How A Test Runs lein test os n1 n2 n3 n4 n5
  13. 13. How A Test Runs lein test db n1 n2 n3 n4 n5
  14. 14. How A Test Runs lein test client 1 client 2 client 3 client 4 client 5 nemesis n1 n2 n3 n4 n5 read write 3 start nemesis write 4 read stop nemesis write 1 cas 2 -> 3 …
  15. 15. How A Test Runs lein test checker 1 – read 2 – write 3 1 – read 0 n – start nemesis 2 – write timed-out 3 – write 4 n – started nemesis 3 – wrote 4 4 – read 4 – read 4 n – stop nemesis 0 – write 1 1 – cas 2 -> 3 n – stopped nemesis … valid? Latency
  16. 16. Single Test Deep-Dive lein test :only cassandra.collections.set-test/ cql-set-isolate-node-decommission
  17. 17. Single Test Name Test name used to label folder where test results, logs, and history will be stored with timestamp cassandra cql set isolate node decommission
  18. 18. Single Test Nodes [:n1 :n2 :n3 :n4 :n5]
  19. 19. Single Test Net net/iptables (drop! ;use iptables to drop packets) (heal! ;flush iptables)
  20. 20. Single Test OS debian/os (setup! ;adjust hostfile ;update package manager ;install base packages like curl, iptables, etc. ;make sure network is healed) (teardown!)
  21. 21. Single Test DB cassandra.core/db (setup! ;shutdown and wipe Cassandra if running ;install, configure, and start Cassandra) (teardown! ;shutdown and wipe Cassandra) (log-files ;return path to log files)
  22. 22. Single Test Client cql-set-client (setup! ;driver connect to all nodes ;create schema) (invoke! ;add? Run CQL to add to set, handle errors ;read? Read value of CQL set, handle errors) (teardown! ;disconnect driver)
  23. 23. Single Test Generator (gen/phases (->> (adds) (gen/stagger 1/10) (gen/delay 1/2) std-gen) (read-once))
  24. 24. Single Test Conductors {:nemesis (nemesis/partition-random-node) :decommissioner (c/decommissioner)}
  25. 25. What a Conductor Is It’s just a client
  26. 26. Single Test Checker checker/set (check ;look at history of run ;find ok or uncertain adds ;compare these to final read ;return map with validity and ;ok, lost, unexpected, recovered)
  27. 27. Invariants We Test Do CQL collections (maps, sets) merge cleanlywhen add-only? Do counters merge to accuratelyreflect increments/decrements? Does LWT in a single datacenterallow us linearizability? Do materialized views converge to matching the base table? Do batch writes eventually get applied atomically?
  28. 28. Failures We Consider How does this work under a variety of network partitions? What about with node crashes? Even if nodes are flushing and compacting? And when nodes are being bootstrapped? Or decommissioned? While clocks drift?
  29. 29. How We Run Start the Docker container Install Java driver, Cassaforte, clj-ssh, and Jepsen Use environment variables to point to build under test Run lein test with any desired selectors and profiles
  30. 30. Tunable Options Should we make a best-effort attempt to scale test length? Should we enable commitlog compression, the coordinator batchlog on materialized views, or hinted handoff? Is a different compaction strategy or phi value in the failure detector appropriate for this test? Should we install from a tagged release, a URL pointing to a tarball, or a local tarball? Should we leave Cassandra running after the test?
  31. 31. What We’ve Found Issues with counter undercounting/overcounting(#10143) Decommission race conditions causing gossip problems (#10231) Write durability violations when recovering commitlog (#9851) Problems with merging of collections (#10001) Batchlog replay failures after decommission/crash (#10068) Incorrect asserts in counter write-path when timestamps collide A variety of materialized view issues during development
  32. 32. Work We Shared Minor Jepsen fixes/features (Jepsen PRs #58, 59, 62) Docker images to run Jepsen tests (Docker Hub: tjake/jepsen) Multibox Vagrant configurations to run Jepsen tests (on GitHub) Upstream library fixes (clj-ssh PR #36) Cassandra Jepsen tests (on GitHub) Available on CassCI (on cassci.datastax.com)
  33. 33. Jepsen on CassCI
  34. 34. Lessons I Learned Tests verifying invariants under failures are valuable and practical These tests can and should be a part of regular development Testing complex systems is hard, but there are low-hanging fruit Jepsen provides one readily available way to accomplish this goal Considering invariants against a recorded test run is effective Invariants should be explicit and carefully considered in design
  35. 35. Thanks Jake Luciani DataStax The Cassandra community Kyle Kingsbury
  36. 36. QUESTIONS? TLA+ • TLC • TLAPS • Clojure Formal Methods • Jepsen CRDTs • Cassandra • Gossip Consistency Models • Alloy Model Checking • Testing @joelknighton #CassandraSummit

×