The speed and scale of complex system operations within cloud-driven architectures make them extremely difficult for humans to mentally model their behavior. This often results in unpredictable and catastrophic outcomes that become costly when unexpected security incidents occur. There is a need to realign the actual state of operational security measures in order to maintain an acceptable level of confidence that our security actually works when we need it to.
As an alternative to simply reacting to failures, the security industry has been overlooking valuable chances to further understand and nurture ‘accidents’ or ‘mistakes’ as opportunities to proactively strengthen system resilience. Chaos Engineering allows us to proactively expose the failures, build resilient systems, and develop an "Applied Security" model to minimize the impact of failures.
Chaos Engineering allows for security teams to proactively experiment and derive new information about underlying factors that were previously unknown. This is done by developing live fire exercises that can be measured, managed, and automated. Contrary to Red/Purple Team exercises, chaos engineering does not use threat actor or adversarial tactics, techniques and procedures. As far as we know it Chaos Engineering is the only proactive mechanism for detecting availability and security incidents before they happen. We proactively introduce turbulent conditions, faults, and failures into our systems to determine the conditions by which our security will fail before it actually does.
In this session we will introduce a new concept known as Security Chaos Engineering and how it can be applied to create highly secure, performant, and resilient distributed systems.
24. After a few
months….
Hard Coded Passwords
Identity Conflicts
Lead Software
Engineering finds a new
job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 850 Microservices
Cloud Provider API
Outage
WAF Outage -> DisabledScalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large CustomerDelayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1
Outage on Portal
Code Freeze
25. Years?….
Hard Coded Passwords
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 4000 Microservices
Cloud Provider API Outage
Firewall Outage -> Disabled
Scalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large Customer
Outage
Delayed Features
DNS Resolution
Errors
Expired Certificate
Regulatory
Audit
Rolling Sev1 Outages on
Portal
Code Freeze
Hard Coded Passwords
Identity Conflicts
Lead Software Engineering
finds a new job at Google
New Security Tool
Refactor Pricing
300 Microservices Δ-> 850 Microservices
Cloud Provider API Outage
WAF Outage -> DisabledScalability Issues
Network is Unreliable
Autoscaling Keeps
Breaking
Large CustomerDelayed Features
DNS Resolution
ErrorsExpired Certificate
Regulatory
Audit
Rolling Sev1 Outage on
Portal
Merger with
competitor
Misconfigured FW Rule Outage
Database Outage
Portal Retry Storm
Outage
Orphaned Documentation
Corporate Reorg
Budget Freeze
Outsource overseas
development
Exposed Secrets on
GithuCode Freeze
b
Migration to New
CSP
Upgrade to Java
SE 12
42. “Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s ability to withstand turbulent
conditions”
Chaos Engineering
48. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Monkey
Story
● During Business Hours
● Born out of Netflix Cloud
Transformation
● Put well defined problems
in front of engineers.
● Terminate VMs on
Random VPC Instances
49. ● Define steady state
● Formulate hypothesis
● Outline methodology
● Identify blast radius
● Observability is key
● Readily abortable
Chaos Pitfalls:Breaking things on Purpose
“I'm pretty sure
I won’t have a job
very long if I
break things on
purpose all day.”
-Casey Rosenthal
The purpose of Chaos Engineering is NOT
to “Break Things on Purpose”.
If anything we are trying to “Fix them on
Purpose”!
Reference: Nora Jones 8 Traps of Chaos Engineering
63. • ChatOps Integration
• Configuration-as-Code
• Example Code & Open Framework
ChaoSlingr Product Features
• Serverless App in AWS
• 100% Native AWS
• Configurable Operational Mode &
Frequency
• Opt-In | Opt-Out Model
64. Hypothesis: If someone accidentally or
maliciously introduced a misconfigured
port then we would immediately detect,
block, and alert on the event.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?
65. Result: Hypothesis disproved. Firewall did not detect
or block the change on all instances. Standard Port
AAA security policy out of sync on the Portal Team
instances. Port change did not trigger an alert and
log data indicated successful change audit.
However we unexpectedly learned the configuration
mgmt tool caught change and alerted the SoC.
Alert
SOC?
Config
Mgmt?
Misconfigured
Port Injection
IR
Triage
Log
data?
Wait...
Firewall?
66. Stop looking for better
answers and start asking
better questions.
- John Allspaw