OWASP AppSec Global 2019 Security & Chaos Engineering

@aaronrinehart @verica_io #chaosengineering
Security Chaos
Engineering

● Combating Complexity in Software
● Chaos Engineering
● Resilience Engineering & Security
● Security Chaos Engineering
Areas Covered

4
Aaron Rinehart, CTO, Founder
● Former Chief Security Architect
@UnitedHealth responsible for security
engineering strategy
● Led the DevOps and Open Source
Transformation at UnitedHealth Group
● Former (DOD, NASA, DHS, CollegeBoard )
● Frequent speaker and author on Chaos
Engineering & Security
● Pioneer behind Security Chaos Engineering
● Led ChaoSlingr team at UnitedHealth
Verica

Incidents,Outages, &
Breaches are Costly

Why do they
seem to be
happening more
often?

Combating
Complexity in
Software

“The growth of complexity
in society has got ahead
of our understanding of
how complex systems
work and fail”
-Sydney Dekker

Our systems have evolved beyond human
ability to mentally model their behavior.
10

Our systems have evolved beyond human
ability to mentally model their behavior.
11
everyone
else

Circuit Breaker Patterns
Continuous
Delivery
Distributed
Systems
Blue/Green
Deployments
Cloud
Computing
Service Mesh
Containers
Immutable
Infrastructur
e
Infracod
e
Continuous
Integration
Microservice
Architectures
API Auto Canaries
CI/CD
DevOps
Automation Pipelines
Complex?

Mostly
Monolithic
Requires
Domain
Knowledge
Prevention
focused Poorly
Aligned
Defense
in Depth
Stateful in
nature
DevSecOps
not widely
adopted
Security?
Expert
Systems
Adversary
Focused

Software has
oﬃcially
taken over

Software Only Increases in Complexity

Accidental Essential
Software Complexity

“As the complexity of a system
increases, the accuracy of any single
agent’s own model of that system
decreases”
- Dr. David Woods
Woods Theorem:

How well do you
really understand
how your system
works?

Systems
Engineering is
Messy
In Reality…….

In the
beginning...we
think it looks like

After a few
months….
Hard Coded Passwords
Identity Conflicts
Lead Software
Engineering finds a new
job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 850 Microservices
Cloud Provider API
Outage
WAF Outage -> DisabledScalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large Customer
Outage
Delayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1
Outage on Portal
Code Freeze

Years?….
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
Cloud Provider API Outage
Firewall Outage -> Disabled
Scalability Issues
Autoscaling Keeps
Breaking
Large Customer
Outage
Delayed Features
DNS Resolution
Errors
Expired Certificate
Regulatory
Audit
Rolling Sev1 Outages on
Portal
Code Freeze
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
Cloud Provider API Outage
WAF Outage -> DisabledScalability Issues
Autoscaling Keeps
Breaking
Large CustomerDelayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1 Outage on
Portal
Merger with
competitor
Misconfigured FW Rule Outage
Database Outage
Portal Retry Storm
Outage
Orphaned Documentation
Corporate Reorg
Budget Freeze
Outsource overseas
development
Exposed Secrets on
GithuCode Freeze
b
Migration to New
CSP
Upgrade to Java
SE 12

Our systems become
more complex and
messy than we
remember them

Avoid Running in the Dark

So what does all of
this $&%* have to
do with Security?

The
Normal
Condition
is to
FAIL

We need failure
to Learn & Grow
32

“things that have never
happened before happen all
the time”
–Scott Sagan “The Limits of Safety”

What happens when
our Security fails?

How do we typically
discover when our
security measures
fail?

Security
Incidents
Typically we dont ﬁnd out our security is
failing until there is an security incident.

Vanishing
Traces
All we typically ever see is the
Footsteps in the Sand
-Allspaw
Logs, Stack Traces,
Alerts

Security incidents are
not eﬀective measures of
detection
because at that point
it's already too late

What typically causes
our security to fail?

‘Human-Error’, Root Cause, &
Blame Culture

No System is inherently Secure by
Default, its Humans that make them
that way.

People Operate Differently
when they expect things to
fail

Chaos
Engineering

“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s ability to withstand turbulent
conditions”
Chaos
Engineering

“[Chaos Engineering is] empirical
rather than formal. We don’t use
models to understand what the
system should do. We run
experiments to learn what it does.”
- Michael T. Nygard

●
●
●
●
●
●
Properties of a
Chaos Experiment
Game Days allow you to perform
experiments with maximum visibility
and coverage from component
owners, support teams and product.
● Deﬁne steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable

Developing a
Learning Culture
around Failure
● Safety as part of security
● Building safety margin
into systems
● Replace blame culture with
learning culture
● Telemetry, experimentation,
and instrumentation

●
●
●
●
●
●
Chaos Engineering
Maturity
Despite what has been popularized on online
tech blogs you do not start oﬀ performing Chaos
Engineering on live production systems. There is
a maturity ramp to getting there.
● Validate Chaos Tools in
Lower Environment
● Develop Competency &
Conﬁdence in Tooling
● Dry-run experiments
Warning: Still be careful in Non-Prod environments as you will be surprised what
hazards lie in Non-Prod. (Kafka Story)

●
●
●
●
●
●
Chaos Monkey
Story
● During Business Hours
● Born out of Netﬂix Cloud
Transformation
● Put well deﬁned problems
in front of engineers.
● Terminate VMs on
Random VPC Instances

●
●
●
●
●
●
Chaos Engineering Pro-Tips
● Don’t perform an experiment
when you expect it to fail
● Auto Remediation of
Experiments will end in a
ﬁery Hell!
● Transparency is a Must
● Webcast & Record
GameDays
● The process of creating the
experiment and sharing the
learnings is the
highest-value of Chaos
Engineering
● Chaos Engineering Goal:
Share Team Mental Models
is of High Importance

●
●
●
●
●
●
Chaos Pitfalls: Auto-Remediation
“…an operator will only be able to generate successful new
strategies for unusual situations if he has an adequate
knowledge of the process.”
“ Long term knowledge develops only through use and
feedback about its effectiveness.”
— Lisanne Bainbridge, The Ironies of Automation (1983)
Bring context or chase down
vulnerabilities for the service
owner instead of automating
fixes as this leads to a Fiery
Hell!
Reference: Nora Jones 8 Traps of Chaos Engineering

●
●
●
●
●
●
Chaos Pitfalls:Breaking things on Purpose
“I'm pretty sure
I won’t have a job
very long if I
break things on
purpose all day.”
-Casey Rosenthal
The purpose of Chaos Engineering is NOT
to “Break Things on Purpose”.
If anything we are trying to “Fix them on
Purpose”!
Reference: Nora Jones 8 Traps of Chaos Engineering

●
●
●
●
●
●
Chaos Engineering
Operational Models
● Organization-Wide Chaos Engineering
Team
● Provide a Chaos Engineering Solution for
Teams to Consume
● CentralTeam runs periodic Chaos
Experiments as a Service
● Provide SREs with Chaos Toolsets
“At Netflix Chaos Engineering
was always meant to be a
tools practice for SREs”
- Casey Rosenthal

●
●
●
●
●
●
GameDay Exercises
● 2-4 hrs in Length
● Diverse Cross Functional Group of
Engineers
● Focused on Increasing Resilience
● Used for Manual Chaos
Engineering
● Great Introduction to Chaos
Engineering
Recommendations
● Use GameDays for New Chaos
Experiments
● Use GameDays for Initial
Experiment Deployment on New
Targets
● Use GameDays for Proving New
Chaos Engineering Tools
● Get Everyone in the Same Location

● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Experiment Lifecycle
1
Perform a GameDay
Exercise
Plan, Schedule, and Run a
GameDay Exercise for
New Experiments
Validate Experiment
Hypothesis
Goal: Validate
experiment ran
successfully and that
the results are credible.
2
Remediate Findings &
Repeat Experiment
If hypothesis failed for
the experiment. Develop
and remediate list of
findings. Once
remediated, repeat
experiment
3
Once Successful:
Automate Experiment
Once the experiment has
been proved to run
successfully validating
your hypothesis you can
now automate the
experiment runs
periodically..
4

GameDays: The Basics
Plan &
Organize
GameDay
Exercise
Execute
Live
GameDay
Operations
Automate &
Evangelize
Results & Take
Action
Chaos
Experiment
Develop &
Evaluate
Conduct
Pre-Incident
Review

Security
Chaos
Engineering

“The discipline of instrumentation, identification,
and remediation of failure within security controls
through proactive experimentation to build
confidence in the system's ability to defend
against malicious conditions in production.”
Security Chaos Engineering is...

Continuous
Security
Veriﬁcation

Reduce Uncertainty by
Building Conﬁdence

Build Conﬁdence
in
What Actually Works

Security Chaos
Engineering
Use Cases

Security Incidents
are Subjective in
Nature

We really don't know
Where? Why? Who?
What?How?
very much

“Response” is the
problem with Incident
Response

Lets face it, when outages
happen…..
Teams spend too much time
reacting to outages instead
of building more resilient
systems.

Post Mortem = Preparation
Lets Flip the Model

Solution
Architecture
“More men(people) die from
their remedies not their
illnesses”
- Jean-Baptiste Poquelin

87
Solutions Architecture
needs reinvention
Patterns never worked
Ivory Tower Architecture

• ChatOps Integration
• Configuration-as-Code
• Example Code & Open Framework
ChaoSlingr Product Features
• Serverless App in AWS
• 100% Native AWS
• Configurable Operational Mode &
Frequency
• Opt-In | Opt-Out Model

Hypothesis: If someone accidentally or
maliciously introduced a misconfigured
port then we would immediately detect,
block, and alert on the event.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?

Result: Hypothesis disproved. Firewall did not detect
or block the change on all instances. Standard Port
AAA security policy out of sync on the Portal Team
instances. Port change did not trigger an alert and
log data indicated successful change audit.
However we unexpectedly learned the configuration
mgmt tool caught change and alerted the SoC.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?

Stop looking for better
answers and start asking
better questions.
- John Allspaw

What is the system actually doing?
Has it done this before?
Why is it behaving that way?
What is it supposed to do next?
How did it get into this state?

How does My Security
Really Work?

What evidence do I
have to prove it?

Cloud Security
Readiness
● Verify Saas Security
Controls
● Verify Cloud Native
Controls
● Verify Security
Configuration

Security
Observability
Monitoring Logging
Tracing Visualization

Security Log
Pipelines Monitoring
Logging
Tracing
Visualization

Improve Value of
Security Log Data
● How valuable is your log
data?
● When do we ever assess
this?
● We dont know our logs
are shit until we
absolutely need them
● Proactively determine
quality of log data
around experiments

Create Objective Feedback
Loops about Security
Eﬀectiveness

How does Security Chaos Engineering
diﬀer from Red Teaming, Purple
Teaming or Pen Testing?
Security
Crayons

● Distributed Systems Focus
● Goal: Experimentation
● Human Factors focused
● Small Isolated Scope
● Focus on Cascading Events
● Performed by Mixed Engineering Teams
in Gameday
● During business hours
Differences in Scope, Focus, and Method

Q&A
@aaronrinehart aaron@verica.io

OWASP AppSec Global 2019 Security & Chaos Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OWASP AppSec Global 2019 Security & Chaos Engineering

Similar to OWASP AppSec Global 2019 Security & Chaos Engineering (20)

More from Aaron Rinehart

More from Aaron Rinehart (6)

Recently uploaded

Recently uploaded (20)

OWASP AppSec Global 2019 Security & Chaos Engineering