Testing Distributed Systems in Anger

TESTING DISTRIBUTED SYSTEMS IN ANGER
NOAH ARLISS
SENIOR DEVELOPMENT MANAGER WORKDAY
See all the presentations from the In-Memory Computing
Summit at http://imcsummit.org

WHO AM I?
 Workday Senior Development Manager
 16+ years experience of software development, architecture, design
and management
 Distributed systems domain expert
 Decentralized Security (WebLogic Enterprise Security)
 Data Fabrics (Oracle Coherence)
 Fabric Team (Workday)
 Passionate about distributed computing and building teams to deliver
complex technologies with quality and reliability
Noah Arliss

THE PROMISE
 The distributed systems ”holy grail”
 Reliability
 Availability
 Scalability
 Performance
 “Lets just throw hardware at the problem”

THE PARADIGM SHIFT
 It’s all about partitions not physical location
 Data is highly available
 Idempotent operations
 Run your code where the data lives
 Event driven architectures
 Lock free parallel algorithms vs. procedural processing
 Know the rules of physics for your system

THE TRADEOFF: CAP THEORUM
 In a distributed system, you can only have two out of the following guarantees* across consecutive read
and write operations:
 Consistency - a read is guaranteed to return the most recent write for a given client
 Availability - a non-failing node will return a reasonable response within a reasonable amount of time (no error or
timeout)
 Partition Tolerance - the system will continue to function when network partitions occur
*Or what I call the “you can’t have your distributed cake and eat it too” theorem

NETWORKS ARE UNRELIABLE
 A fallacy of distributed computing is that networks are reliable - they aren’t!
 In the real world, there are only two choices:
 CP (Consistency/Partition Tolerance) - wait for a response from the partitioned node which could result in a
timeout error
 AP (Availability/Partition Tolerance) - return the most recent version of the data the node has, which could be
stale

WHAT COULD POSSIBLY GO WRONG
 Node GC
 Node deadlock
 Loss of a node
 Loss of a machine
 Network failures
 Network saturation
 Over-provisioned systems
 Loss of a data center

DETERMINISTIC TESTING: A FRAMEWORK
 Challenge: develop a multi-process test
suite where a single test can start and stop
multiple nodes deterministically (when a
predicate is satisfied)
 Bonus points for running the same local test
on multiple remote hosts
 Bonus points for running a single unit test
as a load test!
 Bonus points for throwing chaos at the
system
Local Testing
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Server Farm/EC2
JMeter
Driver
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node

EVENTUAL ASSERTIONS (AN ATOMIC BUILDING BLOCK)
 In a distributed system answers aren’t always readily available
 Test a condition for a period of time before failing
 Exponentially back off the test so as not to overtax the system
 I want to…
 Assert that my cluster is balanced
 Assert that my process is running
 Assert that my service is deployed
 …

EVENTUAL ASSERTION PROJECTS
 Roll your own
 Awaitility - https://github.com/jayway/awaitility
 Oracle Bedrock Eventually - https://github.com/coherence-community/oracle-
bedrock/blob/master/bedrock-testing-support/src/main/java/com/oracle/bedrock/deferred/Eventually.java

PROCESS MANAGEMENT (AN ATOMIC BUILDING BLOCK)
 We prefer multi-JVM tests
 Deterministically and programmatically manage process lifecycle
 Local host processes
 Remote host processes
 Extensibility through configuration over code
 Container Support
 AWS
Local Testing
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Server Farm/EC2
JMeter
Driver
JUnit
Test
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node
Grid
Node

PROCESS MANAGEMENT PROJECTS
 Java Process Management Projects
 Rolled our own
 Ignite Project currently uses GridAbstractTest.startGrid(…) and GridAbstractTest.stopGrid(…)
 Oracle Bedrock https://github.com/coherence-community/oracle-bedrock/tree/master/bedrock-
runtime/src/main/java/com/oracle/bedrock/runtime

TESTING WITH CHAOS
 The Process Monkey:
 Thread 1: Perform some deterministic operation against the system
 Thread 2: Validate that the operation is successful
 Thread 3: Throw chaos a the system by randomly killing nodes in the
system
 Example: Aggregation Test
 Thread 1: Insert monotonically increasing values into a cache
 Thread 2: Calculate a checksum by getting all values and ensuring their
sum is equal to the highest value inserted
 Thread 3: Randomly kill a node in the system
 We want a network monkey too
Inspired by:

PERFORMANCE TESTING
 JMeterJUnitTestRunner test – structure a JUnit test to run
as a JMeter test.
 Modify @BeforeClass and @AfterClass to run at the beginning
and end of the full test run
 Run the same @Test N iterations across M threads
 Write the results out over graphite to influx DB
 Track system telemetry while tracking test performance
 Graphs for everything

WHERE WE ARE NOW
 Internal Test Framework run on every ignite/gridgain upgrade and every platform improvement
 Working on giving back to the community – expect to see a pull request in the near future to improve the
multi-process jvm testing
 Network Monkey – can we leverage the simian army from Netflix to create the same process level
problems on the network in AWS. Can we do it in our own data centers as well
 The Fabric team – putting the easy button on distributed computing (we’re hiring!)

RECOMMENDED READING
 Kyle Kingsbury’s blog - https://aphyr.com
 Brewer’s Conjecture -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf
 Why you can’t sacrifice the P in CAP -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf
 The Fallacies of distributed computing - http://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

Testing Distributed Systems in Anger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Testing Distributed Systems in Anger

Similar to Testing Distributed Systems in Anger (20)

More from In-Memory Computing Summit

More from In-Memory Computing Summit (20)

Recently uploaded

Recently uploaded (20)

Testing Distributed Systems in Anger

Editor's Notes