SOC Lessons from DevOps and SRE by Dr Anton Chuvakin - RSA 2023 Google Cloud sideshow presentation focused on using select DevOps and SRE lessons to make your SOC better
3. A security operations center provides centralized
and consolidated cybersecurity incident
prevention, detection and response capabilities.
Gartner
First Things First: A SOC is Still …
a SOC :-)
SOC is first a team.
Next a process.
And it uses
technology too.
4. 2003 or 2023? Sec Ops is Ripe
for Transformation
We can’t store and analyze
all data, resulting in
blindspots
It’s cost prohibitive to
ingest all the data we need
It takes too long to
investigate alerts
We struggle to build
effective detection and
have too many false
positives/negatives
Our processes are too
manual, we are too slow to
respond to and remediate
threats
We don’t have enough
skilled engineers to make
everything work
6. BARD SAYS: What is the essence
of the DevOps revolution in IT?
DevOps is a set of practices that combines software development
(Dev) and IT operations (Ops). It aims to shorten the systems
development life cycle and provide continuous delivery with high
quality. DevOps is a culture, movement, or practice
that emphasizes the collaboration and communication of both
software developers and other information technology professionals
while automating the process of software delivery and infrastructure
changes.
The DevOps revolution in IT is about breaking down the silos
between development and operations teams and creating a more
collaborative and efficient way of working. It’s about automating
tasks, reducing manual errors, and improving communication and
collaboration.
DevOps can help organizations deliver software faster, more
reliably, and with higher quality. It can also help
organizations reduce costs and improve efficiency.
7. Google vs Enterprise “SecOps”
Problem What does Google do? What do most enterprises do?
Efficiency Automation/SRE is a mindset – part of the hiring
process, part of OKRs, and performance reviews
Experimenting with SOAR, full adoption is tough due
to minimal automation culture
Employee Shortage Requires coding interviews, high pay, attracts the
best, invests in growth
Hires traditional roles, no coding, rarely outsources,
less pay, less growth, more stress
Employee Burnout 40/40/20 between eng, operations, and learning Utilization is almost always >100%
Expensive Investment in efficiency solves for human costs Cost-prohibitive data ingestion, oftentimes paying
SIEM + DIY, increasing $ from complexity
Efficacy TI strongly embedded in D&R, mostly utilized
towards proactive work, strong collaboration across
Alphabet & benefits from developer hygiene
CTI team produces great reports, SOC doing fire
drills, >90% false positive rate, uneven distribution of
skill (Tier 3)
9. Let’s focus on
5 key areas today
Eliminate Toil
Use SLOs
Evolve Automation
Practice Release Engineering
Strive for Simplicity
10. Causes of Toil Less Gathering, More Analysis –
basics to automate
Key Activities To Implement
1. Too much technical debt
2. Priorities or goals are not aligned
3. Lack of training or support
4. Lack of collaboration
5. The business value to fix is too hard to
realize
● Gathering machine information
● Gathering user information
● Process executions
● All context needed to help get to final
(human) judgement
Activity
Train your team on toil & automation
Create an Automation Queue
Implement Blameless Postmortems
Conduct Weekly Incident Reviews
Implement SOAR
Hire Automation Engineer(s)
Implement CD/CR pipelines with metrics
Eliminate Toil
“...manual, repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a service grows.”
01
11. ● Analyst utilization gets optimized
● More creative work, less toil
● Time back to do more proactive work
● Deeper operationalization of intel
● SecOps can scale with the business!
Unit
costs
per
event
Evolve
Automation
02
10X is an Underestimate!
12. Use SLOs
Tips, gotchas, and core metrics to consider
03
Core metrics
Metric
event volume
event source counts
pipeline latency
triage time median
triage time at 95%
Incident resolution times
Common metrics (false positives, # of
incidents, etc)
Key tips Gotchas
1. Optimize metrics for optimal value
2. Manage with indicators + objectives
3. Metrics matter (in context)
4. Defeating attackers beats SLOs
5. Choose metrics that actually matter
6. Make your SLO’s open (within your
company)
● Fast =/= better – don’t incentivize
speed, incentivize thoroughness
● More =/= better – solving 5000 cases
manually is not better than automating
of that; #NoHeroes
13. Practice Release Engineering
04
Ad Hoc
Visibility
Significant Development
effort up front to implement
playbooks
Review playbooks when a
major problem occurs
Response
Orchestration
Security
Analytics
Significant Development
effort up front to implement
use cases
Add/Update detections in
response to major new threat
Onboard log sources as part
of major tech transformation
Review logs for new sources
when a problem comes up
Periodic
Quarterly review of playbook
performance and effectiveness
Dev Sprint to update playbooks
Quarterly review of detection efficacy
Update/Deprecate ineffective
detections
Add/Update detections in response to
major threat
Onboard log sources annual or
quarterly planned schedule
Review data monthly for new log
sources and to identify issues/outages
Continuous
Real-time alerts for detection efficacy drift
Update/Deprecate ineffective detections
at point of discovery
Active Threat Monitoring to proactively
identify new threats to build detections for
Onboard new log sources as they are
ready.
Real-time identification of new log sources
or log drops
Automatic creation of alerts for handling
Live Dashboards showing performance
and accuracy metrics for playbooks
Update/Deprecate ineffective playbooks
at point of discovery
Daily Review of SecOps work queues to
identify automation opportunities
14. "Complex systems require substantial human expertise in
their operation and management. This expertise changes in
character as technology changes but it also changes because
of the need to replace experts who leave. In every case,
training and refinement of skill and expertise is one part of the
function of the system itself. At any moment, therefore, a given
complex system will contain practitioners and trainees with
varying degrees of expertise.
Critical issues related to expertise arise from (1) the need to
use scarce expertise as a resource for the most difficult or
demanding production needs and (2) the need to develop
expertise for future use."
Human expertise in complex systems is
constantly changing
Strive for Simplicity
05
One consequence of not striving for simplicity
https://how.complexsystems.fail/
15. Actions
Reduce toil in your SOC -
shift toil to machines
Use SLOs / metrics
to drive change
Evolve automation in SIEM,
SOAR, threat intel, etc
Practice release engineering
for consistent improvement
Strive for simplicity with
processes, technology stack, etc
16. Improvement
The Power of
Continuous
Improvement
Exponential growth happens faster
when compounded more frequently
Organizing your people and processes
around continuous improvement means
more agility and less resources
Periodic improvement strategies leave
capability gaps between sprints Time
17. Resources
“Achieving Autonomic Security
Operations: Reducing toil”
“Achieving Autonomic Security
Operations: Why metrics matter (but
not how you think)”
“More SRE Lessons for SOC:
Release Engineering Ideas”
“Achieving Autonomic Security
Operations: Automation as a Force
Multiplier”
“More SRE Lessons for SOC:
Simplicity Helps Security”
And security operations is definitely ripe for transformation - many of the challenges that secops teams are faced with have been around for years.
Project deep dive
Add speaker notes or “Paste without formatting” (⌘+Shift+V on Mac) to retain this optimal font size for presenting in MP7
Maximum 5-6 bullets per slide
If presenting someplace other than SVL-MP7-Valley Oak, reduce the speaker notes font size
Shelly
Advanced API security is an add on to Apigee which is focused on premium security services.
It helps users design and build secure APIs. As being part of the API management platform, we are embedded in the entire API lifecycle and are able to provide visibility and controls to API security configurations
Operate securely means how do you secure your APIs in runtime. We detect any abuse on your APIs logic or sensitive information and provide in-product dashboards or integration with SIEM for further analysis and alerting.
Lastly, we bring to Apigee the experience we have in Google with security and Machine Learning in order to improve their security posture.
Project deep dive
We went through a significant modernization effort ourselves, especially in the years after the Aurora attack
In 2015, we had minimal automation in place, and there was a high unit cost to managing D&R events
Over the course of years, Alphabet’s estate grew exponentially, but we were able to achieve a 90% level of efficiency, thanks to our program grounded in SRE-based approaches
Through the years, this radical focus on automation freed up time to allow our engineers to focus on higher order events
More creative work, less toil based work
More proactive work and threat hunting
Better consumption, creation, and operationalization of threat intelligence across our workflows
And most importantly, our engineers have significant influence with upstream development teams, where entire classes of threats can be mitigated before it hits the D&R workflow
We’ve taken these learnings and paired them with our commercial capabilities to help our customers transform their SOC