Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

• Machine Learning for Computer Security Experts using Python & scikit-learn Anaël Bonneton

515 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

  • Be the first to like this

• Machine Learning for Computer Security Experts using Python & scikit-learn Anaël Bonneton

  1. 1. SecuML: Machine Learning for Computer Security Experts Anaël Bonneton anael.bonneton@ssi.gouv.fr ANSSI, ENS Paris, INRIA PyParis 2017
  2. 2. Intrusion Detection System Information System Detection System Communications Alerts Anaël Bonneton PyParis 2017 - SecuML 2/30
  3. 3. Traditional Detection Methods Signatures: Precise detection rules built by security experts Low false alert rate Alerts easy to interprete Not robust to attack variations, to new attacks Anaël Bonneton PyParis 2017 - SecuML 3/30
  4. 4. Traditional Detection Methods Signatures: Precise detection rules built by security experts Low false alert rate Alerts easy to interprete Not robust to attack variations, to new attacks Machine Learning ! Anaël Bonneton PyParis 2017 - SecuML 3/30
  5. 5. Supervised Detection Model Training PDF PDF PDF PDFPDF PDF PDF PDF PDF PDFPDF PDF PDFPDF PDF PDF PDF PDF PDF PDF PDF Training Data Learning Algorithm Classifier Predicting PDF Ok Alert Anaël Bonneton PyParis 2017 - SecuML 4/30
  6. 6. Computer Security Specificities Non Machine Learning Experts Machine Learning pipeline Machine Learning jargon Anaël Bonneton PyParis 2017 - SecuML 5/30
  7. 7. Computer Security Specificities Non Machine Learning Experts Machine Learning pipeline Machine Learning jargon Lack of training data Few public labelled datasets No crowdsourcing Anaël Bonneton PyParis 2017 - SecuML 5/30
  8. 8. Computer Security Specificities Non Machine Learning Experts Machine Learning pipeline Machine Learning jargon Lack of training data Few public labelled datasets No crowdsourcing Need for interpretation How does the detection model work ? Why an alert has been raised ? Anaël Bonneton PyParis 2017 - SecuML 5/30
  9. 9. Applying Machine Learning Deep Learning Frameworks TensorFlow Microsoft Cognitive Toolkit Paddle Anaël Bonneton PyParis 2017 - SecuML 6/30
  10. 10. Applying Machine Learning Deep Learning Frameworks TensorFlow Microsoft Cognitive Toolkit Paddle Cloud Solutions Google Cloud ML Microsoft Azure Amazon Machine Learning Anaël Bonneton PyParis 2017 - SecuML 6/30
  11. 11. Applying Machine Learning Deep Learning Frameworks TensorFlow Microsoft Cognitive Toolkit Paddle Cloud Solutions Google Cloud ML Microsoft Azure Amazon Machine Learning Machine Learning Libraries scikit-learn Mahout, Weka, Vowpal Wabbit Anaël Bonneton PyParis 2017 - SecuML 6/30
  12. 12. SecuML: Beyound scikit-learn Scikit-learn Classification, clustering, dimension reduction, etc. Scaling, grid search, cross validation, etc. Anaël Bonneton PyParis 2017 - SecuML 7/30
  13. 13. SecuML: Beyound scikit-learn Scikit-learn Classification, clustering, dimension reduction, etc. Scaling, grid search, cross validation, etc. SecuML Automation of the Machine Learning pipeline Interactive labelling to acquire training data at low cost Graphical User Interface Anaël Bonneton PyParis 2017 - SecuML 7/30
  14. 14. SecuML Machine Learning for Computer Security Experts Algorithms (scikit-learn, metric-learn, active learning) Web user interface (flask server) Any Type of Data PDF PCAP JavaScriptDOC Netflows EXE Anaël Bonneton PyParis 2017 - SecuML 8/30
  15. 15. SecuML - Input Data features.csv id,f0,f1,f2,f3,f4,... 0,1,3,0,0,0,... 1,1,4,5,4,1,... 2,1,4,32,13,0,... 3,1,3,0,0,0,... 4,1,3,0,0,0,... 5,1,3,0,0,0,... 6,1,6,7,6,0,... 7,1,3,0,0,0,... 8,1,3,0,0,0,... ... true_labels.csv id,label,family 0,M,CVE-2017-30-10 1,M,CVE-2017-30-10 2,B,slides 3,M,CVE-2016-0945 4,M,CVE-2016-0945 5,B,user_manual 6,B,user_manual 7,B,technical_report 8,B,technical_report ... Anaël Bonneton PyParis 2017 - SecuML 9/30
  16. 16. SecuML 1 Set up a Detection Model 2 Acquire a Representative Training Dataset at Low Cost Anaël Bonneton PyParis 2017 - SecuML 10/30
  17. 17. Set up a Detection Model Anaël Bonneton PyParis 2017 - SecuML 11/30
  18. 18. Set up a Detection Model Training Data Features and Labels Training Diagnosis Deployment Anaël Bonneton PyParis 2017 - SecuML 12/30
  19. 19. Automation of the Machine Learning Pipeline Training Pipeline Scaling Cross validation to select the hyperparameters Anaël Bonneton PyParis 2017 - SecuML 13/30
  20. 20. Automation of the Machine Learning Pipeline Training Pipeline Scaling Cross validation to select the hyperparameters Validation of a Detection Model 90% Data Training 10% Data Validation Anaël Bonneton PyParis 2017 - SecuML 13/30
  21. 21. Trust in the Detection Model PDF PDF PDF PDFPDF PDF PDF PDF PDF PDFPDF PDF PDFPDF PDF PDF PDF PDF PDF PDF PDF Training Data Learning Algorithm Classifier Understanding the Classifier How does the detection model work ? Why an alert has been raised ? Anaël Bonneton PyParis 2017 - SecuML 14/30
  22. 22. Trust in the Detection Model How does the detection model work ? CreateDate_correct producer_space image_large Font_count creator_upper_case creator_lower_case Filter_count Page_count modDate_correct OpenAction_count keywords_unique_count XFA_count creator_dot author_lower_case -1.5 -1 -0.5 0 0.5 1 1.5 Anaël Bonneton PyParis 2017 - SecuML 15/30
  23. 23. Trust in the Detection Model Why an alert has been raised ? OpenAction_count keywords_unique_count creator_unique_count creator_dot AcroForm_count createDate_correct producer_space creator_lower_case modDate_correct creator_upper_case producer_unique_count image_large Filter_count -1.5 -1 -0.5 0 0.5 1 1.5 Anaël Bonneton PyParis 2017 - SecuML 16/30
  24. 24. Train and Diagnose a Detection Model Demo ./SecuML_classification LogisticRegression PDF contagio Anaël Bonneton PyParis 2017 - SecuML 17/30
  25. 25. Train and Diagnose a Detection Model Anaël Bonneton PyParis 2017 - SecuML 17/30
  26. 26. Train and Diagnose a Detection Model Anaël Bonneton PyParis 2017 - SecuML 17/30
  27. 27. Train and Diagnose a Detection Model Anaël Bonneton PyParis 2017 - SecuML 17/30
  28. 28. Train and Diagnose a Detection Model Anaël Bonneton PyParis 2017 - SecuML 17/30
  29. 29. Train and Diagnose a Detection Model Anaël Bonneton PyParis 2017 - SecuML 17/30
  30. 30. Train and Diagnose a Detection Model Anaël Bonneton PyParis 2017 - SecuML 17/30
  31. 31. Train and Diagnose a Detection Model Anaël Bonneton PyParis 2017 - SecuML 17/30
  32. 32. Acquire a Representative Training Dataset at Low Cost Anaël Bonneton PyParis 2017 - SecuML 18/30
  33. 33. Annotation System Unlabelled Pool Labelled Dataset Annotation queries Security experts = expensive resources Anaël Bonneton PyParis 2017 - SecuML 19/30
  34. 34. Active Learning Strategy Expert Active Learning Strategy Annotation queriesNew labelled instances Which instances should be annotated ? Which instances are the most informative ? Anaël Bonneton PyParis 2017 - SecuML 20/30
  35. 35. Reducing the number of annotations is not enough ! Expert Active Learning Strategy Annotation queriesNew labelled instances Low expert waiting time Feedback: Your annotations are useful ! User interface for annotating Anaël Bonneton PyParis 2017 - SecuML 21/30
  36. 36. ILAB: Interactive LABelling A whole annotation system Active learning strategy Feedback to the expert User interface for annotating Anaël Bonneton PyParis 2017 - SecuML 22/30
  37. 37. ILAB’s Active Learning Strategy Annotations Queries Anaël Bonneton PyParis 2017 - SecuML 23/30
  38. 38. ILAB’s Active Learning Strategy Annotations Queries Anaël Bonneton PyParis 2017 - SecuML 23/30
  39. 39. ILAB’s Active Learning Strategy Annotations Queries Anaël Bonneton PyParis 2017 - SecuML 23/30
  40. 40. ILAB’s Active Learning Strategy Annotations Queries Close to the decision boundary Anaël Bonneton PyParis 2017 - SecuML 23/30
  41. 41. ILAB’s Active Learning Strategy Annotations Queries Close to the decision boundary Clusters = User-defined Families Anaël Bonneton PyParis 2017 - SecuML 23/30
  42. 42. ILAB’s Active Learning Strategy Annotations Queries Close to the decision boundary Center of the clusters Clusters = User-defined Families Anaël Bonneton PyParis 2017 - SecuML 23/30
  43. 43. ILAB’s Active Learning Strategy Annotations Queries Close to the decision boundary Center of the clusters Edge of the clusters Clusters = User-defined Families Will be published at RAID 2017. Anaël Bonneton PyParis 2017 - SecuML 23/30
  44. 44. ILAB Demo ./SecuML_activeLearning ILAB PDF contagio Anaël Bonneton PyParis 2017 - SecuML 24/30
  45. 45. ILAB Anaël Bonneton PyParis 2017 - SecuML 24/30
  46. 46. ILAB Anaël Bonneton PyParis 2017 - SecuML 24/30
  47. 47. ILAB Anaël Bonneton PyParis 2017 - SecuML 24/30
  48. 48. ILAB Anaël Bonneton PyParis 2017 - SecuML 24/30
  49. 49. SecuML Anaël Bonneton PyParis 2017 - SecuML 25/30
  50. 50. SecuML 1 Acquire a Representative Training Dataset at Low Cost 2 Set up a Detection Model Anaël Bonneton PyParis 2017 - SecuML 26/30
  51. 51. What’s next ? GUI to launch the experiments Anaël Bonneton PyParis 2017 - SecuML 27/30
  52. 52. What’s next ? GUI to launch the experiments Interpretation of more complex models CreateDate_correct producer_space image_large Font_count creator_upper_case creator_lower_case Filter_count Page_count modDate_correct OpenAction_count keywords_unique_count XFA_count creator_dot author_lower_case -1.5 -1 -0.5 0 0.5 1 1.5 Anaël Bonneton PyParis 2017 - SecuML 27/30
  53. 53. What’s next ? GUI to launch the experiments Interpretation of more complex models CreateDate_correct producer_space image_large Font_count creator_upper_case creator_lower_case Filter_count Page_count modDate_correct OpenAction_count keywords_unique_count XFA_count creator_dot author_lower_case -1.5 -1 -0.5 0 0.5 1 1.5 Automatic feature extraction PDF PCAP JavaScriptDOC Netflows EXE Anaël Bonneton PyParis 2017 - SecuML 27/30
  54. 54. SecuML Algorithms and Corresponding Interfaces ! Detection Models Logistic regression, SVM, Naive Bayes, ... Performance, interpretation Interactive Machine Learning Active learning, Rare category detection Annotations, feedback Anaël Bonneton PyParis 2017 - SecuML 28/30
  55. 55. SecuML Algorithms and Corresponding Interfaces ! Clustering K-means, Gaussian Mixtures, ... Instances in each cluster Projection PCA, RCA, LDA, LMNN, .. Projection on two components Anaël Bonneton PyParis 2017 - SecuML 29/30
  56. 56. SecuML is available online ! https://github.com/ANSSI-FR/SecuML Only for Computer Security experts ? Model interpretation Interactive labelling Data visualization with projections Clustering display Anaël Bonneton PyParis 2017 - SecuML 30/30
  57. 57. SecuML is available online ! https://github.com/ANSSI-FR/SecuML Only for Computer Security experts ? Model interpretation Interactive labelling Data visualization with projections Clustering display Anaël Bonneton PyParis 2017 - SecuML 30/30

×