Of Search Lights and Blind Spots: Machine Learning in Cybersecurity

2020 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
OF SEARCH LIGHTS AND BLIND SPOTS:
MACHINE LEARNING IN
CYBERSECURITY
SVEN KRASSER, CHIEF SCIENTIST, CROWDSTRIKE

WHO?
§ CrowdStrike
§ Endpoint protection & breach
prevention
§ Endpoint sensor connecting to Cloud
§ Processing 3 trillion events per week
§ My team: Data Science
§ Malware and threat research
§ Sandbox and dynamic analysis
§ Data engineering
§ Machine Learning research
§ Machine Learning software development
§ Hybrid-Analysis.com

LONG-TIME USE BEHIND THE SCENES

MECHANICS & ENGINEERS*
* Loosely quoted from an unattributed ML researcher
THE DEMOCRATIZATION OF ML

NEW CHALLENGES
"ML as panacea"
“ML is
inherently safe”
ML monoculture
ML performance
is poorly
understood

PROJECTIONS THROUGH 2022
Source: Gartner (2019)
75%Data governance initiatives not
adequately considering AI security
risks, resulting in financial loss
30%Cyberattacks leveraging data
poisoning, model theft, or
adversarial samples

“DO YOU SECURE YOUR ML SYSTEMS TODAY?"
Source: Shankar et al., “Adversarial Machine Learning – Industry Perspectives” (2020)
14%*
“Yes”
2020 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.* ⅓ of organizations polled are in the cybersecurity space

WHY TALK ABOUT THIS FIELD TODAY?
§ Data is plentiful and unencumbered
§ Challenges translate into other domains
§ Static analysis, while limited, is a cheap workhorse
§ Reducing volume of low-effort attacks
§ Saving compute (and hence dollars) for more complex analysis
§ Pre-execution detection
§ Detection on-the-wire (attachment) and at rest (storage)

AV
Update
New
M
alware
1 Day
AV
Update
DetectionRate

BASERATE CHALLENGES
125,000
Executables on an average hard
disk
20,000
Process executions per day

HOW THE GAME WAS PLAYED
Manual evasions and corresponding countermeasures

Hashbusting Polymorphism Packing
Droppers
File
Infectors/Hiding
in Regular Files
Wrapped
Scripts
TRADITIONAL ATTACKER ARSENAL

COUNTERING THE ATTACKER
Heuristics
Static
unpacking
Deep format
inspection
Emulation

① Adversaries
focus on traditional
evasions, which
stick out to ML
② Adversaries
target ML blind
spots
③ Adversaries
leverage ML for
robust evasions
The panacea “track”

0.53 0.28 0.17 0.67 0.56 0.55 0.03 0.04 0.54 0.15
0.56 0.90 0.62 0.97 0.52 0.61 0.82 0.24 0.87 0.36
0.94 0.60 0.53 0.27 0.59 0.63 0.32 0.89 0.91 0.83
0.07 0.57 0.05 0.56 0.95 0.98 0.89 0.24 0.64 0.24
0.45 0.37 0.68 0.25 0.21 0.10 0.52 0.42 0.77 0.11
0.21 0.47 0.05 0.03 0.42 0.96 0.68 0.41 0.96 0.30
0.60 0.50 0.67 0.47 0.80 0.48 0.02 0.53 0.10 0.32
1.00 0.28 0.42 0.31 0.43 0.77 0.11 0.67 0.43 0.31
0.11 0.11 0.70 0.16 0.53 0.58 0.97 0.10 0.83 0.29
0.61 0.31 0.61 0.35 0.03 0.01 0.44 0.77 0.92 0.72
0.26 0.24 0.26 0.03 0.26 0.02 0.35 0.99 0.90 0.03
0.05 0.19 0.27 0.67 0.04 0.48 0.66 0.93 0.04 0.14
0.68 0.69 0.60 0.43 0.12 0.42 0.31 0.74 0.05 0.00
0.98 0.37 0.78 0.46 0.28 0.89 0.01 0.98 0.59 0.75
0.74 0.54 0.63 0.85 0.65 0.22 0.80 0.87 0.82 0.03
0.43 0.91 0.32 0.35 0.21 0.70 0.84 0.36 0.99 0.19
0.92 0.49 0.21 0.50 0.77 0.52 0.60 0.69 0.49 0.38
0.54 0.51 0.07 0.12 0.41 0.40 0.76 0.56 0.20 0.54
0.78 0.61 0.14 0.69 0.39 0.99 0.21 0.90 0.42 0.95
0.09 0.51 0.23 0.22 0.93 0.54 0.00 0.62 0.27 0.98
Problem Space Feature Space
Realizable
Files

WORKING IN FEATURE SPACE
§ Choosing a feature space that always produces realizable files
§ Such as specific binary traits that can be added (but not necessarily removed), e.g. Al-Dujaili
et al. (2018)
§ Imported function names, resources, sections, strings, digital signature, etc.
§ Similar to how an adversary would attack the model
§ Use a substitute model with such a feature space to attack a blackbox model
§ E.g. MalGAN, Hu and Tan (2017)
§ Create (likely) unrealizable feature vectors with some utility
§ Not a realizable attack but allows better preparing for one
§ Increasing robustness at training time
§ Creating pseudo variants for test time (“new family” scenario)

WORKING IN PROBLEM SPACE
A look at both realizable and real-world attacks

Ashkenazy and Zini (2019)
“CHAFF” ATTACK
§ Attack on a security vendor production model deployed on endpoints
§ Unconstrained sparse string-based features
§ “This string exists somewhere in the file”
§ Likely heavily weighted
§ Non-monotonic model
§ Extracting strings from files from the product’s whitelist
§ How to toggle the corresponding features?
§ Add the string somewhere
§ Appending to the end of a Portable Executable (the “overlay”) generally keeps the executable
working
§ à All realizable
§ Bypass achieved

Winning Offensive Solution – Fleshman (2019)
ML STATIC EVASION COMPETITION
§ Modify malware to bypass 3 non-production research models
§ MalConv (DNN, raw bytes)
§ Non-negative MalConv
§ EMBER (engineered features and LightGBM; Anderson and Roth, 2018)
§ Modified files are verified in a sandbox environment
§ DNN models have only unconstrained features (data anywhere can nudge)
§ EMBER has some unconstrained features
§ Byte entropy histogram (continuous features)
§ Strings
§ Data injected in various areas
§ Overlay
§ New sections
§ Empty space at end of sections (alignment)
§ Bypass achieved

Anderson et al. (2018)
LEARNING TO EVADE
§ Reinforcement Learning approach to pick the best sequence of modifications to
achieve evasiveness
§ Action space
§ Modest evasiveness achieved (but no manual intervention as in previous two
approaches)
Add import
Change section
names
Create section
Appending data
to sections
New EP that
jumps to
original EP
Removing
signer info
Changing
debug info
Packing Unpacking
Breaking
header
checksum
Add to overlay Etc.

Elkind (2019)
MITIGATING THROUGH REGULARIZATION
§ Premise
§ We know of several perturbation techniques resulting in realizable attacks
§ We want the model to ignore such modifications without constraining the feature space and
reducing expressiveness
§ Pairwise Hidden Regularization
§ Penalize differences in hidden representations ℎ() in DNN between original file 𝑥 and
perturbed file %𝑥
§ min 𝐿𝑜𝑠𝑠 𝜃 + 𝜆 ℎ 𝑥, 𝜃 − ℎ(%𝑥, 𝜃) !
§ Training on perturbed pairs
§ Notionally, perturbed files have a modified overlay (appended data)
§ Other modifications can be implemented accordingly (e.g. adding sections)
§ Models more robust; evasions more expensive

CONCLUSIONS
Educating decision
makers about ML
Off-the-shelf
guardrails; best
practices for safety
Cost reduction for
the adversary;
means to increase it
again
Opportunity for
defenders to
achieve higher
levels of robustness
Detectability; avoid
silent failure

sven@crowdstrike.com
@SvenKrasser

Of Search Lights and Blind Spots: Machine Learning in Cybersecurity

More Related Content

What's hot

Similar to Of Search Lights and Blind Spots: Machine Learning in Cybersecurity

Recently uploaded

Of Search Lights and Blind Spots: Machine Learning in Cybersecurity