Sponsored workshop at Black Hat USA 2017
https://www.blackhat.com/us-17/business-hall/schedule/#straight-talk-on-machine-learning----what-the-marketing-department-doesnt-want-you-to-know-8203
Straight Talk on Machine Learning -- What the Marketing Department Doesn’t Want You to Know
1. STRAIGHT TALK ON MACHINE LEARNING
WHAT THE MARKETING DEPARTMENT DOESN’T WANT YOU
TO KNOW
DR. SVEN KRASSER CHIEF SCIENTIST
@SVENKRASSER
2. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MACHINE LEARNING AT CROWDSTRIKE
§ ~50 billion events per day
§ ~800 thousand events per second peak
§ ~700 trillion bytes of sample data
§ Local decisions on endpoint and large scale analysis in cloud
§ Static and dynamic analysis techniques, various rich data sources
§ Analysts generating new ground truth 24/7
10. “Buttock Circumference” [mm]
Weight[10-1kg]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S
CLASSIFY
• Get more “blue”
right (true positives)
• Get more “red”
wrong (false
positives)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
19. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
ML IN INFOSEC APPLICATIONS
§ Not a single model solving everything
§ But many models working on the data in scope
§ Endpoint vs cloud
§ Fast response vs long observation
§ Lean vs resource intensive
§ Effectiveness vs interpretability
§ Avoid ML blinders
§ The guy in your store at 2am wielding a crowbar is not a customer
20. FILE
ANALYSIS
AKA Static Analysis
• THE GOOD
– Relatively fast to detect malware
– Scalable
– No need to detonate (“pre-execution”)
– Platform independent, can be done at endpoint or cloud
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration
– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the
defender, months to craft for the attacker
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
22. ENGINEERED FEATURES
FOR EXECUTABLE FILES
32/64BIT
EXECUTABLE
SUBSYSTEM
TYPE
MACHINE
INSTRUCTION
DISTRIBUTION
FILESIZE TIMESTAMP
DEBUG
INFORMATION
PRESENT
PACKERTYPE FILEENTROPY
NUMBEROF
SECTIONS
NUMBER
WRITABLE
SECTIONS
NUMBER
READABLE
SECTIONS
NUMBER
EXECUTABLE
SECTIONS
DISTRIBUTION
OFSECTION
ENTROPY
IMPORTED
DLLNAMES
IMPORTED
FUNCTION
NAMES
COMPILER
ARTIFACTS
LINKER
ARTIFACTS
RESOURCE
DATA
PROTOCOL
STRINGS
IPS/DOMAINS
PATHS
PRODUCT
METADATA
DIGITAL
SIGNATURE
ICON
CONTENT
…
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
23. ENGINEERED FEATURES
FOR EXECUTABLE FILES
32/64BIT
EXECUTABLE
SUBSYSTEM
TYPE
MACHINE
INSTRUCTION
DISTRIBUTION
FILESIZE TIMESTAMP
DEBUG
INFORMATION
PRESENT
PACKERTYPE FILEENTROPY
NUMBEROF
SECTIONS
NUMBER
WRITABLE
SECTIONS
NUMBER
READABLE
SECTIONS
NUMBER
EXECUTABLE
SECTIONS
DISTRIBUTION
OFSECTION
ENTROPY
IMPORTED
DLLNAMES
IMPORTED
FUNCTION
NAMES
COMPILER
ARTIFACTS
LINKER
ARTIFACTS
RESOURCE
DATA
PROTOCOL
STRINGS
IPS/DOMAINS
PATHS
PRODUCT
METADATA
DIGITAL
SIGNATURE
ICON
CONTENT
…
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• Continuous features
• Categorical features
• n-hot encoding
• Embedding
24. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LEARNED
FEATURES
• Unstructured file
content
• Algorithm uncovers
interesting
properties
• Requires a lot more
more input data
• Unlocks more
insight
31. 99% TRUE POSITIVE RATE
31
Chanceofatleastone
successforadversary
Number of attempts
1%
>99.3%
500
32. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Training set distribution generally differs from…
DIFFERENCE IN DISTRIBUTIONS
§ Real-world distribution (customer networks)
§ Evaluations (what customers test)
§ Testing houses (various 3rd party testers with varying methodologies)
§ Community resources (e.g. user submissions to CrowdStrike scanner on
VirusTotal)
33. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FIELD
DISTRIBUTIONClean Malware
Type A
Malware
Type B
39. Next-Generation Endpoint Protection
Cloud Delivered. Enriched by Threat Intelligence
MANAGED
HUNTING
ENDPOINT DETECTION
AND RESPONSE
NEXT-GEN
ANTIVIRUS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
40. KEY
POINTS
• Machine Learning is an effective tool against
unknown threats
• Trading off true positives and false positives
• Features matter, but don’t count them
• One of many uses is static analysis
• Detecting 99% malware means an APT has a
100% chance of getting malware into your
environment
• For ML, distributions matter
• The majority of intrusions are not malware-
based
• Avoid silent failure
• Use a comprehensive array of techniques
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.