Chaos is a ladder !

FULLSTACK TECH RADAR DAY
CHAOS is a Ladder
Haggai Philip Zagury (hagzag) | DevOps Group
& Tech Lead @ Tikal Knowledge

Haggai Philip Zagury
DevOps Group & Tech Lead -> 10+ years @ Tikal
My open thinking and open techniques ideology is driven by Open Source technologies and the
collaborative manner defining my M.O.
My solution driven approach is strongly based on hands-on and deep understanding of Operating
Systems, Applications stacks and Software languages, Networking, Cloud in general and today more
an more Cloud Native solutions.
@hagzag

What is Chaos Engineering ?
The philosophy behind Chaos Engineering

http://bit.ly/2VQGCup
Chaos means many different
things to different people…

In 1 Sentence
‣ Chaos Engineering is the discipline of
experimenting on a distributed system in
order to build conﬁdence in the system’s
capability to withstand turbulent
conditions in production.
Building Trust

Building Resilient Trust in systems is hard !
Backend DevOps Frontend & Mobile
}

Building confidence in computer systems is hard !
● Systems fail (Some “Design to Fail”)
● “Best Effort” Infra
● *aaS
● Cloud
● Cloud native
● Hybrid Cloud
● …

Experiment in Pr duction !

Additional to “Traditional Testing”
● Chaos Engineering goes beyond
traditional (failure) testing in that it's not
only about verifying assumptions. It also
helps us explore the many unpredictable
things that could happen and discover
new properties of our inherently chaotic
systems.

Hypothesis-Driven Experiments
● Hypothesis Define your steady state

● Experiment by challenging it

● Analyse your findings - spread the word

● Hypothesis - Define your steady state
● Analyse your findings - spread the word
● Action items should be noted
● Perhaps run another round with
other limits / variables
● Immune your system (eventually)
Immune

Chaos engineering is:
● Like injecting a Vaccine to immune yourself.
● Increase system resilience - by discovering vulnerabilities
● Identify failure before it becomes an outage
● Better define your steady state (iterative) and constantly challenge it.

Chaos engineering isn’t:
● Breaking down production on purpose.
● A (new) blame mechanism
● Surprising partial outages.
● Taking down all the system at the same time.

Chaos Engineering Origins?
How did we get here ?

DevOps
2010

DevOps
2010 2011
FaaS

DevOps
2010 20111998
How Complex Systems Fail (Being a Short
Treatise on the Nature of Failure;
How Failure is Evaluated; How Failure is Attributed to
Proximate Cause; and the Resulting New
25 years Resilience partitionist

DevOps
2010 20111998
How Complex Systems Fail (Being a Short
Treatise on the Nature of Failure;
How Failure is Evaluated; How Failure is Attributed to
Proximate Cause; and the Resulting New
25 years Resilience partitionist
http://erikhollnagel.com/ideas/resilience-engineering.html
A system is resilient if it can adjust its
functioning prior to, during, or following
events (changes, disturbances, and
opportunities), and thereby sustain
required operations under both expected and
Resilience Engineering

Unleash the Army
DevOps
2010 2011 2014
Chaos Engineer
Role Announced

DevOps
2010 2011 2014
Chaos Engineer
Role Announced
gremlin.com
Failure as a service
Unleash the Army
2015

DevOps
2010 2011 2014
Chaos Engineer
Role Announced
gremlin.com
2017
Unleash the Army
2015

DevOps
2010 20142011
http://erikhollnagel.com/ideas/resilience-engineering.html
2015
20172016
Building trust in 
Chaos Engineering
1998
Chaos Engineer
Role Announced

Where we meet Chaos
How did we get here ?

Where we meet Chaos
Chaos
starts here

In 1 Sentence
‣ Chaos Engineering is the discipline of experimenting on a
distributed system in order to build conﬁdence in the
system’s capability to withstand turbulent
conditions in production.
‣ Preparing for the unknown …
Building Trust

Turbulent condition - failing node in a cluster
default
a b
b
aa a
● 2 services in a 3 node cluster

Turbulent conditions
default
a b
b
aa a
● What’s my application going to suffer from ?

default
a b
b aa
a
● 2 services in a 3 node cluster
● What’s my application going
to suffer from ?
● Is this OK ?

default
a b
b
aa a
● Back to Normal

Turbulents

How to practice Chaos Engineering ?
Perquisites + Tools of Chaos Engineering

Practice
● You should have:
● GameDays
● ChaosDays
● Controlled & Schedule drills /
experiments

Practice & Collaborate
● You should have:
● GameDays
● ChaosDays
● Controlled & Schedule drills /
experiments

It’s slowly becoming a culture
https://github.com/dastergon/awesome-chaos-engineering

Automation is key !

Monitoring (ROI)
Observability
DevOps

Not just graphs and logs (that too)
● RCA’s - recording and being able to reach it !
● Document, Document, Document - great resources on how to do that.
● We don’t Chaos everything …
● Only what makes sense / repeats
● Game / Chaos Days -> keep experiment definitions for GameDay/
ChaosDay to define

SLA … is innovation driven - how fast did you do without
failing ?
https://cloudplatformonline.com/rs/248-TPC-286/images/DORA-State%20of%20DevOps.pdf

Experiment !

Application
Caching
Database
Hardware
Network
What layer ? - All !

The ultimate chaos “butterfly Affect” / “Domino Affect”
● How will my application do
● without cache ?
● without a certain api available ?
● with n sessions

Applying Chos Engineering practices
Log | Messure 
Monitor
Break Things & Auto Recover 
Experiment
Full Cycle - Chaos 
Immune
Application
Caching
Database
Hardware
Network
Security

Where is Chaos going ?
"the discipline of experimenting on
a distributed system in order to
build conﬁdence in the system's
capability to withstand turbulent
conditions in production."

Toolz

Game-day resources
https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/
Planning your GameDay ?
Feel Free to contact me directly -  
we’d be happy to help -> hagzag@tikalk.com

Hypothesis - steady state
{
"name": "all-our-microservices-should-be-healthy",
"type": "probe",
"tolerance": "true",
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": "myapp",
"ns": “default"
}
}
}

Experiment Terminate a pod !
● What to do
● When to do it
{
"type": "action",
"name": "terminate-db-pod",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=my-app",
"name_pattern": "my-app-[0-9]$",
"rand": true,
"ns": "default"
}
},
"pauses": {
"after": 5
}

If your just peeping / evaluating

Chaoskube
● chaoskube is a “chaos-monkey lite” it basically takes down pod based
on a schedule to test your resilience (and there are some tweaks via
configuration)
● use —dry-run
https://github.com/linki/chaoskube

kube-bench
Find vulnerabilities, configuration flags, define your own policies.

kube-hunter (Security)
1. Remote scanning To specify remote machines for hunting, select option 1 or use
the --remote option. Example:./kube-hunter.py --remote some.node.com 
2. Internal scanning To specify internal scanning, you can use the --internal option.
(this will scan all of the machine's network interfaces) Example: ./kube-hunter.py --
internal 
3. Network scanning To specify a speciﬁc CIDR to scan, use the --cidr option.
Example: ./kube-hunter.py --cidr 192.168.0.0/24

Many many more ….
● Stay tuned for more stuff about Chaos Engineering
● https://www.tikalk.com/community

Thank you for joining us
Haggai Philip Zagury
DevOps Group & Tech Lead @ Tikal

Chaos is a ladder !

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chaos is a ladder !

Similar to Chaos is a ladder ! (20)

More from Haggai Philip Zagury

More from Haggai Philip Zagury (15)

Recently uploaded

Recently uploaded (20)

Chaos is a ladder !