SOC Lessons from DevOps and SRE by Anton Chuvakin

Dr. Anton Chuvakin Office of the CISO, Google Cloud

SOC, SecOps, Security
Operations Reminder
01

A security operations center provides centralized
and consolidated cybersecurity incident
prevention, detection and response capabilities.
Gartner
First Things First: A SOC is Still …
a SOC :-)
SOC is first a team.
Next a process.
And it uses
technology too.

2003 or 2023? Sec Ops is Ripe
for Transformation
We can’t store and analyze
all data, resulting in
blindspots
It’s cost prohibitive to
ingest all the data we need
It takes too long to
investigate alerts
We struggle to build
effective detection and
have too many false
positives/negatives
Our processes are too
manual, we are too slow to
respond to and remediate
threats
We don’t have enough
skilled engineers to make
everything work

BARD SAYS: What is the essence
of the DevOps revolution in IT?
DevOps is a set of practices that combines software development
(Dev) and IT operations (Ops). It aims to shorten the systems
development life cycle and provide continuous delivery with high
quality. DevOps is a culture, movement, or practice
that emphasizes the collaboration and communication of both
software developers and other information technology professionals
while automating the process of software delivery and infrastructure
changes.
The DevOps revolution in IT is about breaking down the silos
between development and operations teams and creating a more
collaborative and efficient way of working. It’s about automating
tasks, reducing manual errors, and improving communication and
collaboration.
DevOps can help organizations deliver software faster, more
reliably, and with higher quality. It can also help
organizations reduce costs and improve efficiency.

Google vs Enterprise “SecOps”
Problem What does Google do? What do most enterprises do?
Efficiency Automation/SRE is a mindset – part of the hiring
process, part of OKRs, and performance reviews
Experimenting with SOAR, full adoption is tough due
to minimal automation culture
Employee Shortage Requires coding interviews, high pay, attracts the
best, invests in growth
Hires traditional roles, no coding, rarely outsources,
less pay, less growth, more stress
Employee Burnout 40/40/20 between eng, operations, and learning Utilization is almost always >100%
Expensive Investment in efficiency solves for human costs Cost-prohibitive data ingestion, oftentimes paying
SIEM + DIY, increasing $ from complexity
Efficacy TI strongly embedded in D&R, mostly utilized
towards proactive work, strong collaboration across
Alphabet & benefits from developer hygiene
CTI team produces great reports, SOC doing fire
drills, >90% false positive rate, uneven distribution of
skill (Tier 3)

Let’s focus on
5 key areas today
Eliminate Toil
Use SLOs
Evolve Automation
Practice Release Engineering
Strive for Simplicity

Causes of Toil Less Gathering, More Analysis –
basics to automate
Key Activities To Implement
1. Too much technical debt
2. Priorities or goals are not aligned
3. Lack of training or support
4. Lack of collaboration
5. The business value to fix is too hard to
realize
● Gathering machine information
● Gathering user information
● Process executions
● All context needed to help get to final
(human) judgement
Activity
Train your team on toil & automation
Create an Automation Queue
Implement Blameless Postmortems
Conduct Weekly Incident Reviews
Implement SOAR
Hire Automation Engineer(s)
Implement CD/CR pipelines with metrics
Eliminate Toil
“...manual, repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a service grows.”
01

● Analyst utilization gets optimized
● More creative work, less toil
● Time back to do more proactive work
● Deeper operationalization of intel
● SecOps can scale with the business!
Unit
costs
per
event
Evolve
Automation
02
10X is an Underestimate!

Use SLOs
Tips, gotchas, and core metrics to consider
03
Core metrics
Metric
event volume
event source counts
pipeline latency
triage time median
triage time at 95%
Incident resolution times
Common metrics (false positives, # of
incidents, etc)
Key tips Gotchas
1. Optimize metrics for optimal value
2. Manage with indicators + objectives
3. Metrics matter (in context)
4. Defeating attackers beats SLOs
5. Choose metrics that actually matter
6. Make your SLO’s open (within your
company)
● Fast =/= better – don’t incentivize
speed, incentivize thoroughness
● More =/= better – solving 5000 cases
manually is not better than automating
of that; #NoHeroes

Practice Release Engineering
04
Ad Hoc
Visibility
Significant Development
effort up front to implement
playbooks
Review playbooks when a
major problem occurs
Response
Orchestration
Security
Analytics
Significant Development
effort up front to implement
use cases
Add/Update detections in
response to major new threat
Onboard log sources as part
of major tech transformation
Review logs for new sources
when a problem comes up
Periodic
Quarterly review of playbook
performance and effectiveness
Dev Sprint to update playbooks
Quarterly review of detection efficacy
Update/Deprecate ineffective
detections
Add/Update detections in response to
major threat
Onboard log sources annual or
quarterly planned schedule
Review data monthly for new log
sources and to identify issues/outages
Continuous
Real-time alerts for detection efficacy drift
Update/Deprecate ineffective detections
at point of discovery
Active Threat Monitoring to proactively
identify new threats to build detections for
Onboard new log sources as they are
ready.
Real-time identification of new log sources
or log drops
Automatic creation of alerts for handling
Live Dashboards showing performance
and accuracy metrics for playbooks
Update/Deprecate ineffective playbooks
at point of discovery
Daily Review of SecOps work queues to
identify automation opportunities

"Complex systems require substantial human expertise in
their operation and management. This expertise changes in
character as technology changes but it also changes because
of the need to replace experts who leave. In every case,
training and refinement of skill and expertise is one part of the
function of the system itself. At any moment, therefore, a given
complex system will contain practitioners and trainees with
varying degrees of expertise.
Critical issues related to expertise arise from (1) the need to
use scarce expertise as a resource for the most difficult or
demanding production needs and (2) the need to develop
expertise for future use."
Human expertise in complex systems is
constantly changing
Strive for Simplicity
05
One consequence of not striving for simplicity
https://how.complexsystems.fail/

Actions
Reduce toil in your SOC -
shift toil to machines
Use SLOs / metrics
to drive change
Evolve automation in SIEM,
SOAR, threat intel, etc
Practice release engineering
for consistent improvement
Strive for simplicity with
processes, technology stack, etc

Improvement
The Power of
Continuous
Improvement
Exponential growth happens faster
when compounded more frequently
Organizing your people and processes
around continuous improvement means
more agility and less resources
Periodic improvement strategies leave
capability gaps between sprints Time

Resources
“Achieving Autonomic Security
Operations: Reducing toil”
Operations: Why metrics matter (but
not how you think)”
“More SRE Lessons for SOC:
Release Engineering Ideas”
Operations: Automation as a Force
Multiplier”
“More SRE Lessons for SOC:
Simplicity Helps Security”

SOC Lessons from DevOps and SRE by Anton Chuvakin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SOC Lessons from DevOps and SRE by Anton Chuvakin

Similar to SOC Lessons from DevOps and SRE by Anton Chuvakin (20)

More from Anton Chuvakin

More from Anton Chuvakin (20)

Recently uploaded

Recently uploaded (20)

SOC Lessons from DevOps and SRE by Anton Chuvakin

Editor's Notes