Monkeys & Lemurs and Locusts Oh My - Anti-Fragile Platforms (Sean Keery, Pivotal) - Is the idea of a midnight meltdown keeping you up at night? Are the four levels of HA built into Cloud Foundry enough to put you at ease? Sean Keery will examine how leveraging a combination of exploratory testing practices, in concert with regular load and performance experiments, can simultaneously increase uptime and decrease release cycle times. He will demonstrate how operators can reduce platform risk by regularly injecting failure scenarios into BOSH deployed systems. Demonstrations of the Simian Army, Chaos Lemur and Locust.io tools will be presented. Sean will go beyond reliability, stability and availability to help your platform operations team build a continuous process improvement program which will prepare your production systems for the unexpected.
Who is a platform operator?
Who’s an application developer?
Who's familiar with Agile?
Who's heard of DevOps?
Test driven development?
Why do we do these things?
Minimize risk
Deliver continuously
Here are the types of products you should start with to have your team develop side-by-side with our developers in our office.
Strategies - bad chaos (failure) vs good chaos (users)
* The benefits of all this breaking
* What's not being broken enough
* If a black swan is random event, how do you prepare or simulate random events?
CAP theorem says you can only have 2 of 3. So you always get at least one of the below consequence.
Inconsistent user behavior
Availability that is out of your hands
Communications which are best effort
Anti-fragile systems get stronger when they are injured.
Dragons are mesmerizing, ambitious, and throw themselves into their projects with a zeal that motivates others. Dragons do not care who gets hurt in their pursuit of their ambitions.
More:http://www.gotohoroscope.com/zodiac-signs-compatibility/chinese-horoscope/dragon-monkey.html
Anybody know why this dragon is so special?
It possesses many heads ("more than the vase-painters could paint") and, each time one islost, it isreplaced by two more. It has poisonous breath and blood so virulent that even its scent is deadly. https://en.wikipedia.org/wiki/Lernaean_Hydra
Anti-fragile, you cut off it’s head, two more grow back.
Anybody know any examples from nature? Tamarisk, etc.
Thus exemplifying Anti fragile
Xiangliu is the name of a chinese creature that is like the Greek Hydra. It is a 9 headed serpent like creature.
The hydra in this story represents Cloud Foundry
Shutdown instances, availability zones, introduce lag & jitter. AWS only, can someone get to work on abstracting in a similar manner as Bosh
IaaS
Bad =Simian army
Good =autoscale based on traffic to 100 cells
Black swan example - aws (ec2 limits) & vsphere (san limits)
[js]
* I'm sure you'll talk about it, I'd like to see specifics on
* what do monkeys break and how they break them
* what do lemurs break and how they break them
* what do locusts break and how they break them
Demo1 Git fork – didn’t do it for demo to save time, you should, also create a branch and commit often git clone git://github.com/Netflix/SimianArmy.git ./gradlew build
Make sure you have installed the Cf cli - http://docs.run.pivotal.io/devguide/installcf/install-go-cli.html wget https://cli.run.pivotal.io/stable?release=linux64-binary&version=6.14.0&source=github-rel cf api api.app.srao.layasinchana.com -skip-ssl-validation
Cf login cf target -o seanChaos -s ChaosIaaS vi chaos.properties
cf push simians -p build/libs/simianarmy-2.5.0-SNAPSHOT.war -d app.srao.layasinchana.com Cf logs simians
watch Aws cli - resurrector
watch -n 10 'aws ec2 describe-instances --filter "Name=instance-state-name,Values=pending,shutting-down,stopping" "Name=key-name,Values=sraonew" |jq ".Reservations[].Instances[] | [ .InstanceId,.StateTransitionReason,.Tags[].Value,.State.Name ]"’
Mention jq utility for filtering json
watch simians endpoint
watch -n 2 'curl -s -k https://simians.app.srao.layasinchana.com/api/v1/chaos'
Curl simians terminate & ssl endpoints
curl -k -X POST -H "Content-Type: application/json" -d '{"monkeyType":"CHAOS","eventType":"CHAOS_TERMINATION","eventTime":1343344105651,"region":"us-east-1","groupType":"ASG","groupName":"monkey-target","chaosType":"shutdowninstance"}' https://simians.app.srao.layasinchana.com/api/v1/chaos
curl -k -X POST -H "Content-Type: application/json" -d '{"monkeyType":"CHAOS","eventType":"CHAOS_TERMINATION","eventTime":1343344105651,"region":"us-east-1","groupType":"ASG","groupName":"monkey-target","chaosType":"BlockAllNetworkTraffic"}' https://simians.app.srao.layasinchana.com/api/v1/chaos
curl -k -X POST -H "Content-Type: application/json" -d '{"monkeyType":"CHAOS","eventType":"CHAOS_TERMINATION","eventTime":1343344105651,"region":"us-east-1","groupType":"ASG","groupName":"monkey-target","chaosType":”burncpu"}' https://simians.app.srao.layasinchana.com/api/v1/chaos
Talk to failure Leave it up to participants to fix Pros Mature Scheduled All kinds of additional chaos Docs Cons AWS specific ASG needed Java properties file
Here are the types of products you should start with to have your team develop side-by-side with our developers in our office.
Isolation
Who's got a pipeline for platform deployment?
So let's add some tests to that
Monkey is a great place to get started in your sandbox
CATS & BATS before promotion
Demo2 Skip git and cf login stuff – it’s the same
Touch on environment variables as cloud native app White & black lists Cf push Targeted to deployment, job, az, etc cf logs chaos-lemur –recent or in PCF log viewer watch Aws cli – resurrector or in AWS EC2 Console watch chaos lemur endpoint
curl -k https://chaosDemo:chaosDemo@chaos-lemur.app.srao.layasinchana.com/state
Curl chaos lemur terminate
curl -k -X POST -H "Content-Type: application/json" -d '{ "event": "DESTROY" }' https://chaosDemo:chaosDemo@chaos-lemur.apps.seankeery.com/chaos
Watch tasks
https://chaos-lemur.app.srao.layasinchana.com/task/1
Pros
IaaS independent, leverages cpi
Multi-deployment - large scale chaos
Cons
Not as many types of chaos
Still very broad strokes
Also try -> Turbulence https://github.com/cppforlife/turbulence-release
VM termination on BOSH supported IaaSes
impose CPU/RAM/IO load
network partitioning
packet loss and delay
Anybody know the retry interval for BOSH ‘scan and fix’ ?
Try it on your deployment and let the community know.
Here are the types of products you should start with to have your team develop side-by-side with our developers in our office.
Demo3 Skipped git & cf login Review manifest Cf push Cf logs app1 Watch Cf app app1 for metrics Chrome kick off swarm Pros Simple load testing Easily customizable Cons One at a time
Here are the types of products you should start with to have your team develop side-by-side with your application developers.
Load your app or the CF components
Apps
Locust
Jmeter
Netem
PATS log replay
TDO
Explore opportunities for new tools.
We just tested a docker image.
I'm not familiar with any container level tools. So I'm building my own.
Opportunities for new tools
We can use tools together
Test driven operations – Identify Gaps
Functional
Behavioural
Stateful
Demos
Monitoring
Alerting
BIG Gap = Containers
We just tested a docker image -> no container level tools around -> pirate monkey
We gotta do the same at the container level
Correlating - as your process matures, begin using tools together.
Get with your security team. Add their tests. Don't forget the network guys.
Commonalites = metrics, alerting & logging.
Maybe we need to use teamwork?
Opportunities – Putting it all together to test corner & edge cases
Strategies - bad chaos (failure) vs good chaos (users)
* The benefits of all this breaking
* What's not being broken enough
* If a black swan is random event, how do you prepare or simulate random events?
Correlate to model complex behaviorsCATS (CF) & BATS (BOSH) acceptance tests
BOSH
Triggers on a user-defined schedule, selecting 0 or more VMs to destroy at random during each run.
PaaS
Chaos lemur
Turbulence
Single threaded BOSH = locks
Baselines for alerts
More Feedback loops
Target BOSH API’s https://bosh.io/docs/director-api-v1.html#post-deployment
Strategies - bad chaos (failure) vs good chaos (users)
* The benefits of all this breaking
* What's not being broken enough
* If a black swan is random event, how do you prepare or simulate random events?
Strategies - bad chaos (failure) vs good chaos (users)
* The benefits of all this breaking
* What's not being broken enough
* If a black swan is random event, how do you prepare or simulate random events?