Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IJCNN 2017

1,378 views

Published on

My slides from the Cybersecurity Intelligence panel at the International Joint Conference on Neural Networks 2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

IJCNN 2017

  1. 1. SO YOU GOT A MODEL… DR. SVEN KRASSER CHIEF SCIENTIST @SVENKRASSER A 5 MINUTE RUNDOWN OF THE COMMON AND NOT-SO-COMMON PITFALLS OF APPLYING MACHINE LEARNING IN INFORMATION SECURITY
  2. 2. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. MACHINE LEARNING AT CROWDSTRIKE § ~40 billion events per day § ~800 thousand events per second peak § ~700 trillion bytes of sample data § Local decisions on endpoint and large scale analysis in cloud § Static and dynamic analysis techniques, various rich data sources § Analysts generating new ground truth 24/7
  3. 3. CHALLENGES FOR APPLIED ML
  4. 4. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. FALSE POSITIVE RATE § Most events are associated with clean executions § Most files on a given system are clean § Therefore, even low FPRs cause large numbers of FPs § Industry expectations driven by performance of narrow signatures
  5. 5. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Repeated independent trials guarantee adversary success TRUE POSITIVE RATE § Security cannot be solved with a single ML model § Need to consider various data sources (pre and post- execution) § Augment with non-ML techniques Chanceofatleastonesuccessforadversary Number of attempts at 99% detection rate 1% >99.3%
  6. 6. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. UNWIELDY DATA § Many outliers § Multimodal distributions § Sometimes narrow modes far apart § Adversary-controlled features § Mix of sparse/dense and discrete/continuous features
  7. 7. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Training set distribution generally differs from… DIFFERENCE IN DISTRIBUTIONS § Real-world distribution (customer networks) § Evaluations (what customers test) § Testing houses (various 3rd party testers with varying methodologies) § Community resources (e.g. user submissions to CrowdStrike scanner on VirusTotal)
  8. 8. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Or: the second model needs to be cheaper REPEATABLE SUCCESS § Retraining cadence § Concept drift § Changes in data content (e.g. event field definitions) § Changes in data distribution (e.g. event disposition) § Data cleansing is expensive (conventional wisdom) § Needs automation § Labeling can be expensive § Ephemeral instances (data content or distribution changed) § Lack of sufficient observations § Embeddings and intermediate models § Keep track of input data § Keep track of ground truth budget

×