Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Learning

Bridging Offensive
Operations and
Machine Learning
The Data Dilemma

Introduction
Agenda
Red Team
Challenges
Offensive
Data
Landscape
Offensive ML
Challenges
Nemesis Seeing the
Forest for
the Trees

#whoami
Will Schroeder
@harmj0y
@SpecterOps

Red Team Operations
https://redteam.guide/docs/Concepts/red-vs-pen-vs-vuln

Offensive Machine Learning
“Application of machine learning to offensive problems.”
- Will Pearce (@moo_hax / @dreadnode)
Automating Phishing Sandbox Detection RAG On Stolen Docs
File Share Mining Better Password Guessing EDR Evasion
Tradecraft Suggestions Evading WAFs Password Detection

Adversarial Machine Learning
“Subdiscipline that specifically attacks ML algorithms.”
- Will Pearce (@moo_hax / @dreadnode)
Extraction Evasion Inversion Inference Poisoning

Red Team Challenges
Tradecraft is
difficult to
scale
Offensive data
and tooling is
not unified
File and tool
output triage is
tedious and
inconsistent

• File/data triage is one of the most
common tasks in offensive operations,
but it’s (usually) been heavily manual
• Automated workflows for this type of
task just haven’t existed (until recently)
1
Red Team Challenges
Manual Triage

2
Red Team Challenges
Tooling Issues
• Offensive tools weren’t built to interop
• Tooling to get the data we want might
not even exist on the offensive side!
• We (now) often have to fight
defensive products to get the data we
want…

Red Team Challenges
Data Issues
• Offensive data (like tools) is
mostly unstructured
• Offensive data (like tools) often
also doesn’t interop well
• We also often have heavy data
sensitivity/retention issues
3

4
Red Team Challenges
Scaling Tradecraft
• All operators are not equivalently skilled!
• “Scaling tradecraft” today == writing
articles in Confluence/Notion/etc.
• We don’t even have a way to effectively
scale tradecraft across teams, much less
the industry as a whole

The Offensive
Data
Landscape
3

• It is significantly easier to gather large
data sets on the defensive side than
the offensive side
• Outside of sharing malware, most
organizations keep this data to
themselves - this produces a lot of
asymmetry
• Sidenote: metadata vs full data…
A Defensive View on Data

Differences in Scale
• Defense deals with data scales several
orders of magnitude larger than offense
• Because of the scale difference + the base
rate issue, defense has to be right nearly
100% of the time!
• Offense only has to be mostly right
most/some of the time, and is significantly
more tolerant of false positives and
negatives

The Base Rate Fallacy
https://en.wikipedia.org/wiki/Base_rate_fallacy

Offensive Data Collection Challenges
Batch vs Incremental Ingestion Information Compression
• We want to collect data from multiple
sources like C2 + raw data + etc.
• This presents a significant modeling
challenge as you can't know when
data is “complete”
• Abstractions built from multiple
sources
• Historically, offensive tools have done a
lot of processing on the host itself and
returned relevant “interesting”
information
• Since all data is not returned, the
information is essentially “compressed”
and data is lost
• Collecting/analyzing the raw data
instead supports automation and
researching new attack paths
Example: Windows services are
derived from registry keys

• Offensive-focused data models exist
solely in isolation (BloodHound, etc.)
• A unified offensive data model is very,
very hard, partially due to tool
diversity and lack of interop
• We’re (slowly) trying to work towards
a unified model with Nemesis
Offensive Data Modelling Challenges

Offensive
Machine
Learning
Challenges
4

Lack of Dual-Domain Experts
• There are very few true experts versed in
both information security and machine
learning
• Except these two (and maybe a few others):
@dreadnode

Lack of Relevant Data Sets
Existing public security data sets have traditionally been… largely terrible.
We have the class imbalance problem, the privacy problem, and the
defensive close-hold problem for why good data hasn’t been released
No one releases high-quality,
timely, realistic, security data!
…but why would they?

The Privacy Problem
Would you trust OpenAI or
another provider with a client’s
domain admin password?
Based on client contracts/compliance, are you
even allowed to if you wanted to?

Revelation: Synthetic Data
Good quality, labeled data has almost always been the most common
factor holding us back offensively.
Large state-of-the-art models can be used to generate high quality
synthetic data that we can use to fine tune smaller models (one reason
local models have become so good!)
However, the distribution for generated
synthetic data can, at least in some
cases, differ from the distribute of the
real-world data it’s mimicking, so this
isn’t a silver bullet

Why We Can’t Release Models
This all comes down to the inversion adversarial ML attack

Lack of Security-Focused Models
• There are only a handful of
cybersecurity-focused models (e.g.,
cyBERT)
• This is starting to change with local
model fine-tunes on Huggingface…

A centralized data processing platform that
ingests, enriches, and performs analytics on
offensive security assessment data.
VISION

Offensive Analysis
We want to automate away level 1 analysis
(the boring/tedious tasks) and perform as much
“offline” analysis on raw data as possible
This approach permits analyzing
relationships between (previously)
disparate data sources to accomplish things
that used to require manual analysis
A goal is to provide operator feedback and
suggestions based on collected/analyzed data
(this is where LLM integration can come in!)

Seeing the
Forest For
the Trees
6

It's not just what
Nemesis does, it’s what
it will allow us to do!
This is a (possible) paradigm shift for red teams towards offensive data
unification and off-host data processing that offers numerous advantages.

Advantages
Centrally update
operator analysis
workflows
Enrichments/analytics
added exists in
perpetuity for ALL
operators on ALL
operations
Offline processing
allows for retroactive
analysis of data
Minimizes footprint of
offensive tooling on
endpoints
Collected
structured/unstructured
data can guide future
research and automation

Thank you!
Questions?
https://specterops.io/
http://www.github.com/SpecterOps/Nemesis

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Learning

Similar to Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Learning (20)

More from Will Schroeder

More from Will Schroeder (20)

Recently uploaded

Recently uploaded (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Learning