SO YOU GOT A MODEL…
DR. SVEN KRASSER CHIEF SCIENTIST
@SVENKRASSER
A 5 MINUTE RUNDOWN OF THE COMMON AND NOT-SO-COMMON PITFALLS
OF APPLYING MACHINE LEARNING IN INFORMATION SECURITY
2017	CROWDSTRIKE,	INC.	ALL	RIGHTS	RESERVED.	
MACHINE LEARNING AT CROWDSTRIKE
§ ~40 billion events per day
§ ~800 thousand events per second peak
§ ~700 trillion bytes of sample data
§ Local decisions on endpoint and large scale analysis in cloud
§ Static and dynamic analysis techniques, various rich data sources
§ Analysts generating new ground truth 24/7
CHALLENGES FOR
APPLIED ML
2017	CROWDSTRIKE,	INC.	ALL	RIGHTS	RESERVED.	
FALSE POSITIVE RATE
§ Most events are associated with clean executions
§ Most files on a given system are clean
§ Therefore, even low FPRs cause large numbers of FPs
§ Industry expectations driven by performance of narrow signatures
2017	CROWDSTRIKE,	INC.	ALL	RIGHTS	RESERVED.	
Repeated independent trials guarantee adversary success
TRUE POSITIVE RATE
§ Security cannot be solved
with a single ML model
§ Need to consider various
data sources (pre and post-
execution)
§ Augment with non-ML
techniques
Chanceofatleastonesuccessforadversary
Number of attempts at 99% detection rate
1%
>99.3%
2017	CROWDSTRIKE,	INC.	ALL	RIGHTS	RESERVED.	
UNWIELDY DATA
§ Many outliers
§ Multimodal distributions
§ Sometimes narrow modes far apart
§ Adversary-controlled features
§ Mix of sparse/dense and
discrete/continuous features
2017	CROWDSTRIKE,	INC.	ALL	RIGHTS	RESERVED.	
Training set distribution generally differs from…
DIFFERENCE IN DISTRIBUTIONS
§ Real-world distribution (customer networks)
§ Evaluations (what customers test)
§ Testing houses (various 3rd party testers with varying methodologies)
§ Community resources (e.g. user submissions to CrowdStrike scanner on
VirusTotal)
2017	CROWDSTRIKE,	INC.	ALL	RIGHTS	RESERVED.	
Or: the second model needs to be cheaper
REPEATABLE SUCCESS
§ Retraining cadence
§ Concept drift
§ Changes in data content (e.g. event field definitions)
§ Changes in data distribution (e.g. event disposition)
§ Data cleansing is expensive (conventional wisdom)
§ Needs automation
§ Labeling can be expensive
§ Ephemeral instances (data content or distribution changed)
§ Lack of sufficient observations
§ Embeddings and intermediate models
§ Keep track of input data
§ Keep track of ground truth budget
IJCNN 2017

IJCNN 2017

  • 1.
    SO YOU GOTA MODEL… DR. SVEN KRASSER CHIEF SCIENTIST @SVENKRASSER A 5 MINUTE RUNDOWN OF THE COMMON AND NOT-SO-COMMON PITFALLS OF APPLYING MACHINE LEARNING IN INFORMATION SECURITY
  • 2.
    2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. MACHINE LEARNING ATCROWDSTRIKE § ~40 billion events per day § ~800 thousand events per second peak § ~700 trillion bytes of sample data § Local decisions on endpoint and large scale analysis in cloud § Static and dynamic analysis techniques, various rich data sources § Analysts generating new ground truth 24/7
  • 3.
  • 4.
    2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. FALSE POSITIVE RATE §Most events are associated with clean executions § Most files on a given system are clean § Therefore, even low FPRs cause large numbers of FPs § Industry expectations driven by performance of narrow signatures
  • 5.
    2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Repeated independent trialsguarantee adversary success TRUE POSITIVE RATE § Security cannot be solved with a single ML model § Need to consider various data sources (pre and post- execution) § Augment with non-ML techniques Chanceofatleastonesuccessforadversary Number of attempts at 99% detection rate 1% >99.3%
  • 6.
    2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. UNWIELDY DATA § Manyoutliers § Multimodal distributions § Sometimes narrow modes far apart § Adversary-controlled features § Mix of sparse/dense and discrete/continuous features
  • 7.
    2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Training set distributiongenerally differs from… DIFFERENCE IN DISTRIBUTIONS § Real-world distribution (customer networks) § Evaluations (what customers test) § Testing houses (various 3rd party testers with varying methodologies) § Community resources (e.g. user submissions to CrowdStrike scanner on VirusTotal)
  • 8.
    2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Or: the secondmodel needs to be cheaper REPEATABLE SUCCESS § Retraining cadence § Concept drift § Changes in data content (e.g. event field definitions) § Changes in data distribution (e.g. event disposition) § Data cleansing is expensive (conventional wisdom) § Needs automation § Labeling can be expensive § Ephemeral instances (data content or distribution changed) § Lack of sufficient observations § Embeddings and intermediate models § Keep track of input data § Keep track of ground truth budget