2. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MACHINE LEARNING AT CROWDSTRIKE
§ ~40 billion events per day
§ ~800 thousand events per second peak
§ ~700 trillion bytes of sample data
§ Local decisions on endpoint and large scale analysis in cloud
§ Static and dynamic analysis techniques, various rich data sources
§ Analysts generating new ground truth 24/7
3. BRIEF ML
EXAMPLE
“Buttock Circumference” [mm]
Weight[10-1kg]
• What’s this?
http://tinyurl.com/MLprimer
• Two features
• Two classes
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
5. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
ML IN INFOSEC APPLICATIONS
§ Not a single model solving everything
§ But many models working on the data in scope
§ Endpoint vs cloud
§ Fast response vs long observation
§ Lean vs resource intensive
§ Effectiveness vs interpretability
§ Avoid ML blinders
§ The guy in your store at 2am wielding a crowbar is not a customer
7. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FALSE POSITIVE RATE
§ Most events are associated with clean executions
§ Most files on a given system are clean
§ Therefore, even low FPRs cause large numbers of FPs
§ Industry expectations driven by performance of narrow signatures
10. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Training set distribution generally differs from…
DIFFERENCE IN DISTRIBUTIONS
§ Real-world distribution (customer networks)
§ Evaluations (what customers test)
§ Testing houses (various 3rd party testers with varying methodologies)
§ Community resources (e.g. user submissions to CrowdStrike scanner on
VirusTotal)
11. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Or: the second model needs to be cheaper
REPEATABLE SUCCESS
§ Retraining cadence
§ Concept drift
§ Changes in data content (e.g. event field definitions)
§ Changes in data distribution (e.g. event disposition)
§ Data cleansing is expensive (conventional wisdom)
§ Needs automation
§ Labeling can be expensive
§ Ephemeral instances (data content or distribution changed)
§ Lack of sufficient observations
§ Embeddings and intermediate models
§ Keep track of input data
§ Keep track of ground truth budget
13. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
CLASSIFYING EVENT DATA
§ Idea: global classification
§ Observe all executions for a file, not just a single one
§ Initially only behavioral event data
§ In later versions also combined with static analysis data
§ Early project, focus on the data already there
§ Events fall into various categories, mainly:
§ Process data (hub)
§ Network data
§ DNS data
§ File system data
§ Capping data at 100 seconds since process start
§ Carving out a smaller problem
§ Ignoring classes of malware that are idle initially
19. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Challenges & Lessons Learned
PROCESSING WITH SPARK
§ Issues due to data size
§ Lots of cycles sunk into tuning memory parameters to address job failures
§ Job structure and recovery considerations (reprocessing not always viable)
§ Issues due to input data model
§ Highly referential event data, spreading information across many real-time events
§ Flattened tree/graph-based data
§ Complex to handle in Spark’s RDD model (see DAG)
§ Abstractions such as GraphX may help
§ Processing overhead
§ Job based on Pyspark RDDs – most time spent on serialization/deserialization
§ Initial investment in migrating to Scala would have paid off in deployment
§ Life is now better with Dataframe API
§ Development velocity with Spark
§ Trivial to set up a local dev environment
§ Trivial to add unit tests
23. FILE
ANALYSIS
AKA Static Analysis
• THE GOOD
– Relatively fast
– Scalable
– No need to detonate
– Platform independent, can be done at gateway or cloud
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration
– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the
defender, months to craft for the attacker
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
25. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LEARNED
FEATURES
• Unstructured file
content
• Translated into
embeddings
• Vastly larger
corpus (no labels
needed)
30. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Challenges & Lessons Learned
STATIC ANALYSIS
§ Performance
§ Acceptable results can be achieved quickly
§ State-of-the art results require a bit more tweaking and feature engineering
§ Staying current requires a maintainable data pipeline
§ Hostile data
§ Wild outliers, e.g. PNG width is encoded in 4 bytes
§ All sorts of obfuscations and malformations
§ PE format !(ಠ益ಠ!)
§ What the standard says, what the loader allows…
§ Layers upon layers in an electronic archeological excavation
§ Not everything is documented
§ Tons of subtypes
§ More work
§ More opportunity