VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"

@aaronrinehart @verica_io #chaosengineering
Road from
“Rugged to Chaos”

● Rugged DevOps Journey at United Health
Group
● Combating Complexity in Software
● Chaos Engineering
● Resilience Engineering & Security
● Security Chaos Engineering
Areas Covered

4
Aaron Rinehart, CTO, Founder
● Former Chief Security Architect
@UnitedHealth responsible for security
engineering strategy
● Led the DevOps and Open Source
Transformation at UnitedHealth Group
● Former (DOD, NASA, DHS, CollegeBoard )
● Frequent speaker and author on Chaos
Engineering & Security
● Pioneer behind Security Chaos Engineering
● Led ChaoSlingr team at UnitedHealth
Verica

My knowledge of DevOps &
Agile when I started.

What is DevOps?
“DevOps, a movement of people who care about developing
and operating reliable, secure, high performance systems at
scale, has always — intentionally — lacked a definition or
manifesto.”
– Jez Humble, author “The DevOps Handbook”

The Phoenix Project
A Novel about IT, DevOps, and
Helping Your Business Win
•by Gene Kim, Kevin Bahr and George
Spafford
Our path begins…
The DevOps
Handbook
How to create world-class agility, reliability,
and security in technology organizations
•by Gene Kim, Patrick Debois, John
Willis, Jez Humble and John
Allspaw

Our Journey:
Developer
Enablement
9
Develop the Tools,
Techniques and
Processes needed to
deliver security
services in a world of
Continuous Delivery.

●Drive Security as a Function of Quality
●Building a Better Model: Continuous Delivery is
Better Security
○ Focus on Delivering Value
○ Continuous Security Model
○ Enable DevOps Strategy and Automation
A New Paradigm: Bold Steps

●Teams across Silos & Disciplines
○ 60 Developers, Operations Engineers, and Security Leaders from across the entire
company.
●Began with Six Core DevOps Security Problem Sets
○ Security Baseline + Configuration Validation w/ Chef & Inspec
○ Gauntlt Rugged Attack Framework
○ Static Code Analysis (SAST): Automatiing Fortify with Jenkins via API
○ Application Vulnerability Scans(DAST): Automating WebInspect with Jenkins via API
○ DevOps Self-Governance & Operationalization Framework: How does this world look from
an operational support perspective?
○ Clair Container Image Scanning: Building Image Scanning into Jenkins
A Grass Roots Beginning

Chef + InSpec:
Automated Security
Configuration &
Validation at Speed
12
Case Study: State
Health Exchange

●Enable Deployment & Compliance at Speed and
Scale
●Allow developers to leverage “Security”- Approved
Chef server compliance cookbooks
●Compliance is built into the initial server standup
process and immediately confirmed prior to
release for use
○ No longer a late “add on”
○ It is “just another cookbook” that can be automatically applied
Shift Left

Initial Approach: 15 weeks+
● Stood up 300+ servers from service catalog over 1 weekend
● Waited weeks for extra build services beyond the catalog
● Allowed app and middleware teams to configure in parallel
● After 2+ months were able to apply compliance rules using
Security Blanket
○ Required 2+ weeks just to run, Resulted in compliance tickets, Remediation and rework
Alternatively: “Orders of Magnitude Differential”
○ Run Time: 300 servers⭢6 mins ⭢ 30 hours
○ Setup time: 40-100 hours ***
DevOps & State Health Exchange Migration
●
●
March
April
May
June
July

Gauntlt: “Be Mean
to Your Code”
15
Case Study: Driving
Security Testing into the
Pipeline: Automated
Vulnerability Scanning

Security as a Function of Quality: Gauntlt
○ An open source application vulnerability scanner engine that enables a self-service
vulnerability resolution solution
○ Automates use of multiple vulnerability security scanning tools
○ Provides packages allowing developers to easily run self-service security checks
against their applications
○ Scans begin immediately and take only minutes to complete

Lessons Learned in
DevOps Transformation
17
Takeaways, that will
fundamentally change the
entire strategy.

Automation & Tools
are Important but
“Don’t be Distracted
by it”
18
Emphasize ….
Simplification &
Standardization
….over Automation

Start Small & Focus
19
Shift Left……One capability at a time…

Embrace Failure as
a Friend
20
Plan and expect failure as a positive
outcome. Encourage teams to fail quickly
and learn from them.

Seek the Input &
Passion of Others
21
In the end, it has
been the folks
most passionate
about each
problem that
achieved success.

Voice of the
Customer
22
Define, understand, and listen
to your customer as part of
your journey. You will be
surprised how eager they are to
help you.

DevsecOps over next 5 Years: Written 3 years ago..
23
The Next Generation of Security Professionals will be Chosen from DevOps Teams
1
A Big Data Problem: The challenge becomes more about the data outputs than the toolsets.2
Shared Responsibility becomes more of a reality.3
Security is seen as an integral part of the value stream4
There will be a new breed of security capabilities created by Inner Source efforts. i.e. Netflix Security
Monkey5

• Fail small, fail fast
• Its a culture shift, not just about automation
• Drive out complexity: Complex things don’t scale
• Avoid Analysis Paralysis: DevOps is a culture and a
living organism
• DevOps is not a fad, it is the future
• Automation: Focus on where the human adds value.
Automate everything else.
Key Takeaways
24

Incidents,Outages, &
Breaches are Costly

Why do they
seem to be
happening more
often?

Combating
Complexity in
Software

“The growth of complexity
in society has got ahead
of our understanding of
how complex systems
work and fail”
-Sydney Dekker

Our systems have evolved beyond human
ability to mentally model their behavior.
30

Our systems have evolved beyond human
ability to mentally model their behavior.
31
everyone
else

Circuit Breaker Patterns
Continuous
Delivery
Distributed
Systems
Blue/Green
Deployments
Cloud
Computing
Service Mesh
Containers
Immutable
Infrastructur
e
Infracod
e
Continuous
Integration
Microservice
Architectures
API Auto Canaries
CI/CD
DevOps
Automation Pipelines
Complex?

Mostly
Monolithic
Requires
Domain
Knowledge
Prevention
focused Poorly
Aligned
Defense
in Depth
Stateful in
nature
DevSecOps
not widely
adopted
Security?
Expert
Systems
Adversary
Focused

Software has
oﬃcially
taken over

Software Only Increases in Complexity

Accidental Essential
Software Complexity

“As the complexity of a system
increases, the accuracy of any single
agent’s own model of that system
decreases”
- Dr. David Woods
Woods Theorem:

How well do you
really understand
how your system
works?

Systems
Engineering is
Messy
In Reality…….

In the
beginning...we
think it looks like

After a few
months….
Hard Coded Passwords
Identity Conflicts
Lead Software
Engineering finds a new
job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 850 Microservices
Cloud Provider API
Outage
WAF Outage -> DisabledScalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large Customer
Outage
Delayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1
Outage on Portal
Code Freeze

Years?….
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
Cloud Provider API Outage
Firewall Outage -> Disabled
Scalability Issues
Autoscaling Keeps
Breaking
Large Customer
Outage
Delayed Features
DNS Resolution
Errors
Expired Certificate
Regulatory
Audit
Rolling Sev1 Outages on
Portal
Code Freeze
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
Cloud Provider API Outage
WAF Outage -> DisabledScalability Issues
Autoscaling Keeps
Breaking
Large CustomerDelayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1 Outage on
Portal
Merger with
competitor
Misconfigured FW Rule Outage
Database Outage
Portal Retry Storm
Outage
Orphaned Documentation
Corporate Reorg
Budget Freeze
Outsource overseas
development
Exposed Secrets on
GithuCode Freeze
b
Migration to New
CSP
Upgrade to Java
SE 12

Our systems become
more complex and
messy than we
remember them

Avoid Running in the Dark

So what does all of
this $&%* have to
do with Security?

The
Normal
Condition
is to
FAIL

We need failure
to Learn & Grow
52

“things that have never
happened before happen all
the time”
–Scott Sagan “The Limits of Safety”

What happens when
our Security fails?

How do we typically
discover when our
security measures
fail?

Security
Incidents
Typically we dont ﬁnd out our security is
failing until there is an security incident.

Vanishing
Traces
All we typically ever see is the
Footsteps in the Sand
-Allspaw
Logs, Stack Traces,
Alerts

Security incidents are
not eﬀective measures of
detection
because at that point
it's already too late

What typically causes
our security to fail?

‘Human-Error’, Root Cause, &
Blame Culture

No System is inherently Secure by
Default, its Humans that make them
that way.

People Operate Differently
when they expect things to
fail

Chaos
Engineering

“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s ability to withstand turbulent
conditions”
Chaos
Engineering

“[Chaos Engineering is] empirical
rather than formal. We don’t use
models to understand what the
system should do. We run
experiments to learn what it does.”
- Michael T. Nygard

●
●
●
●
●
●
Chaos Engineering
Maturity
Despite what has been popularized on online
tech blogs you do not start oﬀ performing Chaos
Engineering on live production systems. There is
a maturity ramp to getting there.
● Validate Chaos Tools in
Lower Environment
● Develop Competency &
Conﬁdence in Tooling
● Dry-run experiments
Warning: Still be careful in Non-Prod environments as you will be surprised what
hazards lie in Non-Prod. (Kafka Story)

●
●
●
●
●
●
Chaos Monkey
Story
● During Business Hours
● Born out of Netﬂix Cloud
Transformation
● Put well deﬁned problems
in front of engineers.
● Terminate VMs on
Random VPC Instances

●
●
●
●
●
●
Chaos Pitfalls: Auto-Remediation
“…an operator will only be able to generate successful new
strategies for unusual situations if he has an adequate
knowledge of the process.”
“ Long term knowledge develops only through use and
feedback about its effectiveness.”
— Lisanne Bainbridge, The Ironies of Automation (1983)
Bring context or chase down
vulnerabilities for the service
owner instead of automating
fixes as this leads to a Fiery
Hell!
Reference: Nora Jones 8 Traps of Chaos Engineering

●
●
●
●
●
●
Chaos Pitfalls:Breaking things on Purpose
“I'm pretty sure
I won’t have a job
very long if I
break things on
purpose all day.”
-Casey Rosenthal
The purpose of Chaos Engineering is NOT
to “Break Things on Purpose”.
If anything we are trying to “Fix them on
Purpose”!
Reference: Nora Jones 8 Traps of Chaos Engineering

●
●
●
●
●
●
GameDay Exercises
● 2-4 hrs in Length
● Diverse Cross Functional Group of
Engineers
● Focused on Increasing Resilience
● Used for Manual Chaos
Engineering
● Great Introduction to Chaos
Engineering
Recommendations
● Use GameDays for New Chaos
Experiments
● Use GameDays for Initial
Experiment Deployment on New
Targets
● Use GameDays for Proving New
Chaos Engineering Tools
● Get Everyone in the Same Location

● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Experiment Lifecycle
1
Perform a GameDay
Exercise
Plan, Schedule, and Run a
GameDay Exercise for
New Experiments
Validate Experiment
Hypothesis
Goal: Validate
experiment ran
successfully and that
the results are credible.
2
Remediate Findings &
Repeat Experiment
If hypothesis failed for
the experiment. Develop
and remediate list of
findings. Once
remediated, repeat
experiment
3
Once Successful:
Automate Experiment
Once the experiment has
been proved to run
successfully validating
your hypothesis you can
now automate the
experiment runs
periodically..
4

GameDays: The Basics
Plan &
Organize
GameDay
Exercise
Execute
Live
GameDay
Operations
Automate &
Evangelize
Results & Take
Action
Chaos
Experiment
Develop &
Evaluate
Conduct
Pre-Incident
Review

Security
Chaos
Engineering

“The discipline of instrumentation, identification,
and remediation of failure within security controls
through proactive experimentation to build
confidence in the system's ability to defend
against malicious conditions in production.”
Security Chaos Engineering is...

Continuous
Security
Veriﬁcation

Reduce Uncertainty by
Building Conﬁdence

Build Conﬁdence
in
What Actually Works

Security Chaos
Engineering
Use Cases

Security Incidents
are Subjective in
Nature

We really don't know
Where? Why? Who?
What?How?
very much

“Response” is the
problem with Incident
Response

Lets face it, when outages
happen…..
Teams spend too much time
reacting to outages instead
of building more resilient
systems.

Post Mortem = Preparation
Lets Flip the Model

Solution
Architecture
“More men(people) die from
their remedies not their
illnesses”
- Jean-Baptiste Poquelin

103
Solutions Architecture
needs reinvention
Patterns never worked
Ivory Tower Architecture

• ChatOps Integration
• Configuration-as-Code
• Example Code & Open Framework
ChaoSlingr Product Features
• Serverless App in AWS
• 100% Native AWS
• Configurable Operational Mode &
Frequency
• Opt-In | Opt-Out Model

Hypothesis: If someone accidentally or
maliciously introduced a misconfigured
port then we would immediately detect,
block, and alert on the event.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?

Result: Hypothesis disproved. Firewall did not detect
or block the change on all instances. Standard Port
AAA security policy out of sync on the Portal Team
instances. Port change did not trigger an alert and
log data indicated successful change audit.
However we unexpectedly learned the configuration
mgmt tool caught change and alerted the SoC.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?

Stop looking for better
answers and start asking
better questions.
- John Allspaw

Q&A
@aaronrinehart aaron@verica.io

VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"

Similar to VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering" (20)

More from Aaron Rinehart

More from Aaron Rinehart (8)

Recently uploaded

Recently uploaded (20)

VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"