This document discusses testing distributed systems. It begins by outlining some of the promises and challenges of distributed systems, including reliability, availability, scalability, and the CAP theorem. It then discusses some common things that can go wrong in distributed systems like node failures. The rest of the document proposes a framework for deterministic testing of distributed systems, including eventual assertions, process management, testing with chaos, and performance testing. It concludes by discussing some work the speaker is currently doing and recommends further reading on distributed systems.
1. TESTING DISTRIBUTED SYSTEMS IN ANGER
NOAH ARLISS
SENIOR DEVELOPMENT MANAGER WORKDAY
See all the presentations from the In-Memory Computing
Summit at http://imcsummit.org
2.
3. WHO AM I?
Workday Senior Development Manager
16+ years experience of software development, architecture, design
and management
Distributed systems domain expert
Decentralized Security (WebLogic Enterprise Security)
Data Fabrics (Oracle Coherence)
Fabric Team (Workday)
Passionate about distributed computing and building teams to deliver
complex technologies with quality and reliability
Noah Arliss
5. THE PROMISE
The distributed systems ”holy grail”
Reliability
Availability
Scalability
Performance
“Lets just throw hardware at the problem”
6. THE PARADIGM SHIFT
It’s all about partitions not physical location
Data is highly available
Idempotent operations
Run your code where the data lives
Event driven architectures
Lock free parallel algorithms vs. procedural processing
Know the rules of physics for your system
7. THE TRADEOFF: CAP THEORUM
In a distributed system, you can only have two out of the following guarantees* across consecutive read
and write operations:
Consistency - a read is guaranteed to return the most recent write for a given client
Availability - a non-failing node will return a reasonable response within a reasonable amount of time (no error or
timeout)
Partition Tolerance - the system will continue to function when network partitions occur
*Or what I call the “you can’t have your distributed cake and eat it too” theorem
8. NETWORKS ARE UNRELIABLE
A fallacy of distributed computing is that networks are reliable - they aren’t!
In the real world, there are only two choices:
CP (Consistency/Partition Tolerance) - wait for a response from the partitioned node which could result in a
timeout error
AP (Availability/Partition Tolerance) - return the most recent version of the data the node has, which could be
stale
9.
10. WHAT COULD POSSIBLY GO WRONG
Node GC
Node deadlock
Loss of a node
Loss of a machine
Network failures
Network saturation
Over-provisioned systems
Loss of a data center
11. DETERMINISTIC TESTING: A FRAMEWORK
Challenge: develop a multi-process test
suite where a single test can start and stop
multiple nodes deterministically (when a
predicate is satisfied)
Bonus points for running the same local test
on multiple remote hosts
Bonus points for running a single unit test
as a load test!
Bonus points for throwing chaos at the
system
Local Testing
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Server Farm/EC2
JMeter
Driver
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node
12. EVENTUAL ASSERTIONS (AN ATOMIC BUILDING BLOCK)
In a distributed system answers aren’t always readily available
Test a condition for a period of time before failing
Exponentially back off the test so as not to overtax the system
I want to…
Assert that my cluster is balanced
Assert that my process is running
Assert that my service is deployed
…
13. EVENTUAL ASSERTION PROJECTS
Roll your own
Awaitility - https://github.com/jayway/awaitility
Oracle Bedrock Eventually - https://github.com/coherence-community/oracle-
bedrock/blob/master/bedrock-testing-support/src/main/java/com/oracle/bedrock/deferred/Eventually.java
14. PROCESS MANAGEMENT (AN ATOMIC BUILDING BLOCK)
We prefer multi-JVM tests
Deterministically and programmatically manage process lifecycle
Local host processes
Remote host processes
Extensibility through configuration over code
Container Support
AWS
Local Testing
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Server Farm/EC2
JMeter
Driver
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node
15. PROCESS MANAGEMENT PROJECTS
Java Process Management Projects
Rolled our own
Ignite Project currently uses GridAbstractTest.startGrid(…) and GridAbstractTest.stopGrid(…)
Oracle Bedrock https://github.com/coherence-community/oracle-bedrock/tree/master/bedrock-
runtime/src/main/java/com/oracle/bedrock/runtime
16. TESTING WITH CHAOS
The Process Monkey:
Thread 1: Perform some deterministic operation against the system
Thread 2: Validate that the operation is successful
Thread 3: Throw chaos a the system by randomly killing nodes in the
system
Example: Aggregation Test
Thread 1: Insert monotonically increasing values into a cache
Thread 2: Calculate a checksum by getting all values and ensuring their
sum is equal to the highest value inserted
Thread 3: Randomly kill a node in the system
We want a network monkey too
Inspired by:
17. PERFORMANCE TESTING
JMeterJUnitTestRunner test – structure a JUnit test to run
as a JMeter test.
Modify @BeforeClass and @AfterClass to run at the beginning
and end of the full test run
Run the same @Test N iterations across M threads
Write the results out over graphite to influx DB
Track system telemetry while tracking test performance
Graphs for everything
18. WHERE WE ARE NOW
Internal Test Framework run on every ignite/gridgain upgrade and every platform improvement
Working on giving back to the community – expect to see a pull request in the near future to improve the
multi-process jvm testing
Network Monkey – can we leverage the simian army from Netflix to create the same process level
problems on the network in AWS. Can we do it in our own data centers as well
The Fabric team – putting the easy button on distributed computing (we’re hiring!)
19. RECOMMENDED READING
Kyle Kingsbury’s blog - https://aphyr.com
Brewer’s Conjecture -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf
Why you can’t sacrifice the P in CAP -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf
The Fallacies of distributed computing - http://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
Reliability – the ability of a component to perform their function for a desired period of time without failure
Availability – The probability that a system will respond given that it’s not in a failed state
Scalability – How well the service can grow to meet existing performance demands
Performance – How performant the service is in terms of latency and throughput
[JH] you might want to clarify (probably just talk though it) that you are talking about storage nodes and not
[JH] Wow – Brian’s code looks like shit in GH. Regardless, perhaps call out the benefit of exponential backoff?
[JH] Not sure I understand the right hand bullets. But perhaps we should put a line in the sand that there isn’t a cohesive test framework out there that does all this and perhaps a good slide to say that we have one and will be contributing it back? Not to say that enumerating the alternatives isn’t a good idea – it’s just the last bullet lacks power. Maybe introduce it in a separate slide?
[JH] Wow – Brian’s code looks like shit in GH. Regardless, perhaps call out the benefit of exponential backoff?
[JH] You are using our class names as way to explain a concept. Explain it in as concise of a manner as you can without relying on our own class names. You can definitely talk to the fact we have implementations of those concepts, but don’t rely on our naming to convey the idea.