Successfully reported this slideshow.

IJCNN 2017

0

Share

Loading in …3
×
1 of 9
1 of 9

IJCNN 2017

0

Share

Download to read offline

Description

My slides from the Cybersecurity Intelligence panel at the International Joint Conference on Neural Networks 2017

Transcript

  1. 1. SO YOU GOT A MODEL… DR. SVEN KRASSER CHIEF SCIENTIST @SVENKRASSER A 5 MINUTE RUNDOWN OF THE COMMON AND NOT-SO-COMMON PITFALLS OF APPLYING MACHINE LEARNING IN INFORMATION SECURITY
  2. 2. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. MACHINE LEARNING AT CROWDSTRIKE § ~40 billion events per day § ~800 thousand events per second peak § ~700 trillion bytes of sample data § Local decisions on endpoint and large scale analysis in cloud § Static and dynamic analysis techniques, various rich data sources § Analysts generating new ground truth 24/7
  3. 3. CHALLENGES FOR APPLIED ML
  4. 4. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. FALSE POSITIVE RATE § Most events are associated with clean executions § Most files on a given system are clean § Therefore, even low FPRs cause large numbers of FPs § Industry expectations driven by performance of narrow signatures
  5. 5. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Repeated independent trials guarantee adversary success TRUE POSITIVE RATE § Security cannot be solved with a single ML model § Need to consider various data sources (pre and post- execution) § Augment with non-ML techniques Chanceofatleastonesuccessforadversary Number of attempts at 99% detection rate 1% >99.3%
  6. 6. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. UNWIELDY DATA § Many outliers § Multimodal distributions § Sometimes narrow modes far apart § Adversary-controlled features § Mix of sparse/dense and discrete/continuous features
  7. 7. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Training set distribution generally differs from… DIFFERENCE IN DISTRIBUTIONS § Real-world distribution (customer networks) § Evaluations (what customers test) § Testing houses (various 3rd party testers with varying methodologies) § Community resources (e.g. user submissions to CrowdStrike scanner on VirusTotal)
  8. 8. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Or: the second model needs to be cheaper REPEATABLE SUCCESS § Retraining cadence § Concept drift § Changes in data content (e.g. event field definitions) § Changes in data distribution (e.g. event disposition) § Data cleansing is expensive (conventional wisdom) § Needs automation § Labeling can be expensive § Ephemeral instances (data content or distribution changed) § Lack of sufficient observations § Embeddings and intermediate models § Keep track of input data § Keep track of ground truth budget

Description

My slides from the Cybersecurity Intelligence panel at the International Joint Conference on Neural Networks 2017

Transcript

  1. 1. SO YOU GOT A MODEL… DR. SVEN KRASSER CHIEF SCIENTIST @SVENKRASSER A 5 MINUTE RUNDOWN OF THE COMMON AND NOT-SO-COMMON PITFALLS OF APPLYING MACHINE LEARNING IN INFORMATION SECURITY
  2. 2. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. MACHINE LEARNING AT CROWDSTRIKE § ~40 billion events per day § ~800 thousand events per second peak § ~700 trillion bytes of sample data § Local decisions on endpoint and large scale analysis in cloud § Static and dynamic analysis techniques, various rich data sources § Analysts generating new ground truth 24/7
  3. 3. CHALLENGES FOR APPLIED ML
  4. 4. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. FALSE POSITIVE RATE § Most events are associated with clean executions § Most files on a given system are clean § Therefore, even low FPRs cause large numbers of FPs § Industry expectations driven by performance of narrow signatures
  5. 5. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Repeated independent trials guarantee adversary success TRUE POSITIVE RATE § Security cannot be solved with a single ML model § Need to consider various data sources (pre and post- execution) § Augment with non-ML techniques Chanceofatleastonesuccessforadversary Number of attempts at 99% detection rate 1% >99.3%
  6. 6. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. UNWIELDY DATA § Many outliers § Multimodal distributions § Sometimes narrow modes far apart § Adversary-controlled features § Mix of sparse/dense and discrete/continuous features
  7. 7. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Training set distribution generally differs from… DIFFERENCE IN DISTRIBUTIONS § Real-world distribution (customer networks) § Evaluations (what customers test) § Testing houses (various 3rd party testers with varying methodologies) § Community resources (e.g. user submissions to CrowdStrike scanner on VirusTotal)
  8. 8. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Or: the second model needs to be cheaper REPEATABLE SUCCESS § Retraining cadence § Concept drift § Changes in data content (e.g. event field definitions) § Changes in data distribution (e.g. event disposition) § Data cleansing is expensive (conventional wisdom) § Needs automation § Labeling can be expensive § Ephemeral instances (data content or distribution changed) § Lack of sufficient observations § Embeddings and intermediate models § Keep track of input data § Keep track of ground truth budget

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

×