Using Security to drive
Chaos Engineering
Dinis Cruz, CISO, Photobox
April 2018
https://pbx-group-security.com
I’m a CISO focused on
securing our client’s Magic moments
by creating secure environments
that enable and accelerate the business
and contribute to the 

top and bottom line
Here are my challenges

How to make rational risk based decisions
How to create high performance teams
How to scale Security knowledge
How to drive and enable change
How to map data as graphs
We are also hiring :)


Head of AppSec

Head of Cloud Security
What is chaos
engineering
Success story
Netflix recommendations
Chaos Engineering is
Evolution of Testing
https://www.slideshare.net/NoraJones1/choose-your-own-adventure-qcon-2017-1
Let’s look at a number of
Chaos Engineering
definitions
Chaos Engineering
Building Confidence
in System Behaviour
through experiments
Chaos Engineering is
about trying controlled
changes to observe system
availability deviation
Chaos Engineering is 

carefully injecting harm into 

our systems
to test the system’s ability to
respond to it.
Chaos Engineering 

is the discipline of experimenting on a

distributed system
in order to build confidence
in the system’s capability to withstand
turbulent conditions in production
(most business friendly)
Chaos Engineering is
limited scope,
continuous,
disaster recovery
PrinciplesOfChaos.org
1. Start by defining ‘steady state’ as some measurable output of a system that
indicates normal behaviour.
2. Hypothesise that this steady state will continue in both the control group and
the experimental group.
3. Introduce variables that reflect real world events like servers that crash, hard
drives that malfunction, network connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady state
between the control group and the experimental group.
Chaos in practice - 4 experiments
http://principlesofchaos.org/
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
5.Minimise Blast Radius
Advanced Principles
http://principlesofchaos.org/
the idea that “Chaos engineering is
not Testing” 

Is caused by 

the failure to make TDD (Test-Driven
development) Scale
TDD Demo


1) Real-time

Test Execution

2) Real time 

code coverage

Chaos Engineering 

is testing

the different are the test abstractions
and the extra random layer
Security as chaos
generator
Cyber Security is a ‘change
generation factory’
Developers and Techops are
chaos creators
Security testing (and users)
are chaos creators
I have been called 

Director of chaos :)
The myth of the
singe point of failure
(i.e. attackers only need to run
code and find a weak spot)
Do you understand what is
going on in your network?
Biggest threat is not the issue,
but is not having visibility
When do you know about
security incidents? 

(or changes)
You need to know what the
attackers are 

doing on your system
(and users)
If you don’t know what is on
the pentest report …
you have a bigger problem
(i.e. your SOC should be able to tell you)
Best Security model is one
based on
the attacker making a mistake
(i.e. a change)
Use risks to understand reality
and to make the business
owners responsible for their
decisions
Use Threat Models to
understand how your system
works and to document it
Use tests to replicate
known behaviours, attacks
and simulate changes 

(with and without random events)
Which can also called
Security tests

(which pass on vulnerable
state and on regression test)
Some scenarios
If a server on your cloud is
mining bitcoins
Would you know about it?
If a server or app misbehaves ?
When do you know about it?
If your servers start running
30% slower?
What happens to your apps?
If your servers fails to reboot
after a patch
What happens to your system?
When (not if) you have
malicious or api breaking
dependencies?
How do you know about them?
If 3rd parties are using your
APIs (official or not) to dump
your user’s data (aka Facebook)
Would you know about it?
Properties of resilient
and secure systems
Availability
Plan for Failure
Ability to sustain failures
Validate and Sanitise
all requests
Authenticate and Authorise
all requests
Reduce capabilities and
features gracefully
Hostile to
insecure traffic
and
insecure code
Have error budgets 

(from Google SRE)
Are easy to change
Are easy to refactor
(make changes with confidence)
Pushes to production 

happen minutes
(fully tested and 100x a day (if needed))
The bigger they
get the faster they go


(it is smooth and safe to make changes)
Have 99% change coverage
Change coverage
It is not about
test code coverage
What matters is
change coverage
If you make changes
and they are not detected
you are just making
random changes
Basically
you are an
agent of chaos
Every change you make
has to have a respective
test change 

(much better pair programming model)
Chaos Engineering
is modern change
management
Chaos Engineering
is the programatic
introduction of changes
one more thing
Following from 2017 edition
https://owaspsummit.org
Collaboration @ 16x per day for 5x days
Open Security Summit 2018 - London
https://open-security-summit.org/4th - 8th of June
Any questions
@DinisCruz

Using security to drive chaos engineering - April 2018