SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
Practical Machine Learning in Information Security
Practical Machine Learning in Information Security
1.
PRACTICAL MACHINE LEARNING
IN INFORMATION SECURITY
DR. SVEN KRASSER CHIEF SCIENTIST
@SVENKRASSER
2.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MACHINE LEARNING AT CROWDSTRIKE
§ ~40 billion events per day
§ ~800 thousand events per second peak
§ ~700 trillion bytes of sample data
§ Local decisions on endpoint and large scale analysis in cloud
§ Static and dynamic analysis techniques, various rich data sources
§ Analysts generating new ground truth 24/7
3.
BRIEF ML
EXAMPLE
“Buttock Circumference” [mm]
Weight[10-1kg]
• What’s this?
http://tinyurl.com/MLprimer
• Two features
• Two classes
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
4.
MODEL
FIT
“Buttock Circumference” [mm]
Weight[10-1kg]
• Support Vector
Machine
• Real world:
more features
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
5.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
ML IN INFOSEC APPLICATIONS
§ Not a single model solving everything
§ But many models working on the data in scope
§ Endpoint vs cloud
§ Fast response vs long observation
§ Lean vs resource intensive
§ Effectiveness vs interpretability
§ Avoid ML blinders
§ The guy in your store at 2am wielding a crowbar is not a customer
7.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FALSE POSITIVE RATE
§ Most events are associated with clean executions
§ Most files on a given system are clean
§ Therefore, even low FPRs cause large numbers of FPs
§ Industry expectations driven by performance of narrow signatures
8.
TRUE POSITIVE RATE
8
Chanceofatleastone
successforadversary
Number of attempts at 99% detection rate
1%
>99.3%
500
9.
UNWIELDY DATA
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
§ Many outliers
§ Multimodal distributions
§ Sometimes narrow modes far
apart
§ Adversary-controlled features
§ Mix of sparse/dense and
discrete/continuous features
10.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Training set distribution generally differs from…
DIFFERENCE IN DISTRIBUTIONS
§ Real-world distribution (customer networks)
§ Evaluations (what customers test)
§ Testing houses (various 3rd party testers with varying methodologies)
§ Community resources (e.g. user submissions to CrowdStrike scanner on
VirusTotal)
11.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Or: the second model needs to be cheaper
REPEATABLE SUCCESS
§ Retraining cadence
§ Concept drift
§ Changes in data content (e.g. event field definitions)
§ Changes in data distribution (e.g. event disposition)
§ Data cleansing is expensive (conventional wisdom)
§ Needs automation
§ Labeling can be expensive
§ Ephemeral instances (data content or distribution changed)
§ Lack of sufficient observations
§ Embeddings and intermediate models
§ Keep track of input data
§ Keep track of ground truth budget
13.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
CLASSIFYING EVENT DATA
§ Idea: global classification
§ Observe all executions for a file, not just a single one
§ Initially only behavioral event data
§ In later versions also combined with static analysis data
§ Early project, focus on the data already there
§ Events fall into various categories, mainly:
§ Process data (hub)
§ Network data
§ DNS data
§ File system data
§ Capping data at 100 seconds since process start
§ Carving out a smaller problem
§ Ignoring classes of malware that are idle initially
14.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
RELEVANT ARCHITECTURE IN A NUTSHELL
Event
collector
Message
bus
Sensor
population
S3
Spark
Hash
DB
Cloud
15.
15
HIGH-LEVEL JOB FLOW
Read in event
data
•Filter by event type
•Filter unneeded
fields
Aggregate per-
process data
•Add derived features
•One-to-one:
combine events
process creation and
termination
•One-to-many:
combine events such
as DNS requests
(many per process)
and add result to
process record
Direct children
•Join to parents and
copy parent data
into children
•Aggregate children
features by their
parent
Second order
children
•Aggregate second
order children by
their parent’s parent
•Aggregate with
direct children
Process features
•Combine process
data with children
data
Hash features
•Roll up all process
data by hash
•Output per-hash
statistics as
behavioral features
2017 CrowdStrike, Inc. All rights reserved.
16.
Process records
Children records
Hash record
...
(a)
(a)
(b)
(b)
(c)
(d)
AGGREGATION
2015 CrowdStrike, Inc. All rights reserved.16
17.
LABELS
2017 CrowdStrike, Inc. All rights reserved.
Clean
training data
Dirty
training data
Sandbox
deployment
Unlabeled
data
Large scale
field
deployment
§ Field data contains too little
malware
§ Extra malware executions in
sandbox
§ Need to consider bias introduced
by sandbox
§ Parent process
§ Execution time
§ Location of file
18.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SPARKJOBDAG
19.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Challenges & Lessons Learned
PROCESSING WITH SPARK
§ Issues due to data size
§ Lots of cycles sunk into tuning memory parameters to address job failures
§ Job structure and recovery considerations (reprocessing not always viable)
§ Issues due to input data model
§ Highly referential event data, spreading information across many real-time events
§ Flattened tree/graph-based data
§ Complex to handle in Spark’s RDD model (see DAG)
§ Abstractions such as GraphX may help
§ Processing overhead
§ Job based on Pyspark RDDs – most time spent on serialization/deserialization
§ Initial investment in migrating to Scala would have paid off in deployment
§ Life is now better with Dataframe API
§ Development velocity with Spark
§ Trivial to set up a local dev environment
§ Trivial to add unit tests
20.
Smaller Larger
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
EVOLUTION
§ Operating on
fewer events
§ Rich event
data
§ Very fast
decisions
§ Moving event
correlation
into graph
database
§ Operating on
large event
volumes
22.
VIRUSTOTAL INTEGRATION
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
23.
FILE
ANALYSIS
AKA Static Analysis
• THE GOOD
– Relatively fast
– Scalable
– No need to detonate
– Platform independent, can be done at gateway or cloud
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration
– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the
defender, months to craft for the attacker
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
24.
ENGINEERED FEATURES
32/64BIT
EXECUTABLE
GUI
SUBSYSTEM
COMMAND
LINE
SUBSYSTEM
FILESIZE TIMESTAMP
DEBUG
INFORMATION
PRESENT
PACKERTYPE FILEENTROPY
NUMBEROF
SECTIONS
NUMBER
WRITABLE
NUMBER
READABLE
NUMBER
EXECUTABLE
DISTRIBUTION
OFSECTION
ENTROPY
IMPORTED
DLLNAMES
IMPORTED
FUNCTION
NAMES
COMPILER
ARTIFACTS
LINKER
ARTIFACTS
RESOURCE
DATA
PROTOCOL
STRINGS
IPS/DOMAINS
PATHS
PRODUCT
METADATA
DIGITAL
SIGNATURE
ICON
CONTENT
…
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
25.
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LEARNED
FEATURES
• Unstructured file
content
• Translated into
embeddings
• Vastly larger
corpus (no labels
needed)
26.
String-based feature
Executablesectionsize-based
feature
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
COMBINING
FEATURES
27.
Subspace Projection A
SubspaceProjectionB
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
COMBINING
FEATURES
28.
PRODUCTIONFLOW
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Sample Data
Labels
Cloud FX
Engine
Model
29.
Embed
Embed
PRODUCTIONFLOW
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Sample Data
Labels
Learned
Features and
Embeddings
Cloud FX
Engine
Sensor FX
Engine
Feed
Processing
Re-
processing
μService FX Worker Endpoints
Docker
Feature rankings
30.
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Challenges & Lessons Learned
STATIC ANALYSIS
§ Performance
§ Acceptable results can be achieved quickly
§ State-of-the art results require a bit more tweaking and feature engineering
§ Staying current requires a maintainable data pipeline
§ Hostile data
§ Wild outliers, e.g. PNG width is encoded in 4 bytes
§ All sorts of obfuscations and malformations
§ PE format !(ಠ益ಠ!)
§ What the standard says, what the loader allows…
§ Layers upon layers in an electronic archeological excavation
§ Not everything is documented
§ Tons of subtypes
§ More work
§ More opportunity