Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical Machine Learning in Information Security

3,018 views

Published on

Slides from the Security Data Science Colloquium hosted by Microsoft on June 7, 2017.

Published in: Technology
  • Be the first to comment

Practical Machine Learning in Information Security

  1. 1. PRACTICAL MACHINE LEARNING IN INFORMATION SECURITY DR. SVEN KRASSER CHIEF SCIENTIST @SVENKRASSER
  2. 2. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. MACHINE LEARNING AT CROWDSTRIKE § ~40 billion events per day § ~800 thousand events per second peak § ~700 trillion bytes of sample data § Local decisions on endpoint and large scale analysis in cloud § Static and dynamic analysis techniques, various rich data sources § Analysts generating new ground truth 24/7
  3. 3. BRIEF ML EXAMPLE “Buttock Circumference” [mm] Weight[10-1kg] • What’s this? http://tinyurl.com/MLprimer • Two features • Two classes 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  4. 4. MODEL FIT “Buttock Circumference” [mm] Weight[10-1kg] • Support Vector Machine • Real world: more features 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  5. 5. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. ML IN INFOSEC APPLICATIONS § Not a single model solving everything § But many models working on the data in scope § Endpoint vs cloud § Fast response vs long observation § Lean vs resource intensive § Effectiveness vs interpretability § Avoid ML blinders § The guy in your store at 2am wielding a crowbar is not a customer
  6. 6. CHALLENGES FOR APPLIED ML
  7. 7. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. FALSE POSITIVE RATE § Most events are associated with clean executions § Most files on a given system are clean § Therefore, even low FPRs cause large numbers of FPs § Industry expectations driven by performance of narrow signatures
  8. 8. TRUE POSITIVE RATE 8 Chanceofatleastone successforadversary Number of attempts at 99% detection rate 1% >99.3% 500
  9. 9. UNWIELDY DATA 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. § Many outliers § Multimodal distributions § Sometimes narrow modes far apart § Adversary-controlled features § Mix of sparse/dense and discrete/continuous features
  10. 10. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Training set distribution generally differs from… DIFFERENCE IN DISTRIBUTIONS § Real-world distribution (customer networks) § Evaluations (what customers test) § Testing houses (various 3rd party testers with varying methodologies) § Community resources (e.g. user submissions to CrowdStrike scanner on VirusTotal)
  11. 11. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Or: the second model needs to be cheaper REPEATABLE SUCCESS § Retraining cadence § Concept drift § Changes in data content (e.g. event field definitions) § Changes in data distribution (e.g. event disposition) § Data cleansing is expensive (conventional wisdom) § Needs automation § Labeling can be expensive § Ephemeral instances (data content or distribution changed) § Lack of sufficient observations § Embeddings and intermediate models § Keep track of input data § Keep track of ground truth budget
  12. 12. GLOBAL BEHAVIORAL ANALYSIS
  13. 13. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. CLASSIFYING EVENT DATA § Idea: global classification § Observe all executions for a file, not just a single one § Initially only behavioral event data § In later versions also combined with static analysis data § Early project, focus on the data already there § Events fall into various categories, mainly: § Process data (hub) § Network data § DNS data § File system data § Capping data at 100 seconds since process start § Carving out a smaller problem § Ignoring classes of malware that are idle initially
  14. 14. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. RELEVANT ARCHITECTURE IN A NUTSHELL Event collector Message bus Sensor population S3 Spark Hash DB Cloud
  15. 15. 15 HIGH-LEVEL JOB FLOW Read in event data •Filter by event type •Filter unneeded fields Aggregate per- process data •Add derived features •One-to-one: combine events process creation and termination •One-to-many: combine events such as DNS requests (many per process) and add result to process record Direct children •Join to parents and copy parent data into children •Aggregate children features by their parent Second order children •Aggregate second order children by their parent’s parent •Aggregate with direct children Process features •Combine process data with children data Hash features •Roll up all process data by hash •Output per-hash statistics as behavioral features 2017 CrowdStrike, Inc. All rights reserved.
  16. 16. Process records Children records Hash record ... (a) (a) (b) (b) (c) (d) AGGREGATION 2015 CrowdStrike, Inc. All rights reserved.16
  17. 17. LABELS 2017 CrowdStrike, Inc. All rights reserved. Clean training data Dirty training data Sandbox deployment Unlabeled data Large scale field deployment § Field data contains too little malware § Extra malware executions in sandbox § Need to consider bias introduced by sandbox § Parent process § Execution time § Location of file
  18. 18. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. SPARKJOBDAG
  19. 19. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Challenges & Lessons Learned PROCESSING WITH SPARK § Issues due to data size § Lots of cycles sunk into tuning memory parameters to address job failures § Job structure and recovery considerations (reprocessing not always viable) § Issues due to input data model § Highly referential event data, spreading information across many real-time events § Flattened tree/graph-based data § Complex to handle in Spark’s RDD model (see DAG) § Abstractions such as GraphX may help § Processing overhead § Job based on Pyspark RDDs – most time spent on serialization/deserialization § Initial investment in migrating to Scala would have paid off in deployment § Life is now better with Dataframe API § Development velocity with Spark § Trivial to set up a local dev environment § Trivial to add unit tests
  20. 20. Smaller Larger 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. EVOLUTION § Operating on fewer events § Rich event data § Very fast decisions § Moving event correlation into graph database § Operating on large event volumes
  21. 21. ML-BASED ANTI- MALWARE
  22. 22. VIRUSTOTAL INTEGRATION 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  23. 23. FILE ANALYSIS AKA Static Analysis • THE GOOD – Relatively fast – Scalable – No need to detonate – Platform independent, can be done at gateway or cloud • THE BAD – Limited insight due to narrow view – Different file types require different techniques – Different subtypes need special consideration – Packed files – .Net – Installers – EXEs vs DLLs – Obfuscations (yet good if detectable) – Ineffective against exploitation and malware-less attacks – Asymmetry: a fraction of a second to decide for the defender, months to craft for the attacker 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  24. 24. ENGINEERED FEATURES 32/64BIT EXECUTABLE GUI SUBSYSTEM COMMAND LINE SUBSYSTEM FILESIZE TIMESTAMP DEBUG INFORMATION PRESENT PACKERTYPE FILEENTROPY NUMBEROF SECTIONS NUMBER WRITABLE NUMBER READABLE NUMBER EXECUTABLE DISTRIBUTION OFSECTION ENTROPY IMPORTED DLLNAMES IMPORTED FUNCTION NAMES COMPILER ARTIFACTS LINKER ARTIFACTS RESOURCE DATA PROTOCOL STRINGS IPS/DOMAINS PATHS PRODUCT METADATA DIGITAL SIGNATURE ICON CONTENT … 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  25. 25. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. LEARNED FEATURES • Unstructured file content • Translated into embeddings • Vastly larger corpus (no labels needed)
  26. 26. String-based feature Executablesectionsize-based feature 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. COMBINING FEATURES
  27. 27. Subspace Projection A SubspaceProjectionB 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. COMBINING FEATURES
  28. 28. PRODUCTIONFLOW 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Sample Data Labels Cloud FX Engine Model
  29. 29. Embed Embed PRODUCTIONFLOW 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Sample Data Labels Learned Features and Embeddings Cloud FX Engine Sensor FX Engine Feed Processing Re- processing μService FX Worker Endpoints Docker Feature rankings
  30. 30. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Challenges & Lessons Learned STATIC ANALYSIS § Performance § Acceptable results can be achieved quickly § State-of-the art results require a bit more tweaking and feature engineering § Staying current requires a maintainable data pipeline § Hostile data § Wild outliers, e.g. PNG width is encoded in 4 bytes § All sorts of obfuscations and malformations § PE format !(ಠ益ಠ!) § What the standard says, what the loader allows… § Layers upon layers in an electronic archeological excavation § Not everything is documented § Tons of subtypes § More work § More opportunity

×