Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fundamentals of Machine Learning: Perspectives from a Data Scientist (ISC West 2018)

650 views

Published on

https://www.iscwest.com/en/Sessions/52328/Fundamentals-of-Machine-Learning-Perspectives-from-a-Data-Scientist

Abstract:

As our world grows more connected, organizations are collecting ever-growing amounts of data. Almost always there are hidden insights in such data that can lead to better outcomes and more value. One important tool to tap into these opportunities is Machine Learning (ML), and across all verticals more and more companies are investing into their ML operations. In this talk, we will take a look at what ML is, what problems it solves, how it is applied, and why companies need to make sure that they have a strategy to employ ML.
First, we will explain the relevant fundamental concepts with a focus on supervised learning and geometric models. An intuitive data set with an accessible instance space from the physical world is used to illustrate our ability to classify data. Various models are used and visually represented to explain the underlying algorithms in an accessible fashion.
Next, we will discuss how ML is revolutionizing approaches to cybersecurity, and how the cybersecurity industry has been changing its approach to the data it collects. From there, we explore other applications in the larger domain of security.
Lastly, we will wrap up with an outlook of where this technology is going and some pointers to get started with employing ML to the data you already collect.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Fundamentals of Machine Learning: Perspectives from a Data Scientist (ISC West 2018)

  1. 1. Fundamentals of Machine Learning: Perspectives from a Data Scientist Dr. Sven Krasser, Chief Scientist, CrowdStrike, Inc.
  2. 2. REALITYHYPE
  3. 3. MASS PRODUCTION
  4. 4. Unsupervised Learning Clustering 1 2 3
  5. 5. Supervised Learning Classification
  6. 6. Supervised Learning Classification
  7. 7. 1988 Anthropometric Survey of Army Personnel
  8. 8. Source: http://mreed.umtri.umich.edu/mreed/downloads.html#anthro • Over 4000 soldiers surveyed • Over 100 types of measurements • Reported by gender
  9. 9. FIRST LOOK Height [mm] Density • Difference in distribution • Significant overlap
  10. 10. SECOND DIMENSION Height [mm] Weight[10-1kg] • Correlation • Overlap
  11. 11. FEATURE SELECTION “Buttock Circumference” [mm] Weight[10-1kg] • Correlation • Gender-specific slope • Reduced overlap • Selection of features matters • How to make a prediction?
  12. 12. k-NEAREST NEIGHBOR “Buttock Circumference” [mm] Weight[10-1kg] m f
  13. 13. SUPPORT VECTOR MACHINE “Buttock Circumference” [mm] Weight[10-1kg]
  14. 14. SUPPORT VECTOR MACHINE “Buttock Circumference” [mm] Weight[10-1kg] • Overfitting • Classifier does not generalize • Let’s take a closer look…
  15. 15. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1kg] • Classifier generalizes • Note some misclassifications • Let’s assume we want to detect males (blue) – I.e. “blue” is our positive class
  16. 16. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1kg]
  17. 17. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1kg]
  18. 18. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1kg]
  19. 19. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1kg]
  20. 20. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1kg] • Get more “blue” right (true positives) • Get more “red” wrong (false positives)
  21. 21. RECEIVER OPERATING CHARACTERISTICS CURVE False Positive Rate TruePositiveRate Detect more by accepting more false positives
  22. 22. MORE DIMENSIONS • Some 160 dimensions • Projected back to 2-dimensional screen • Perfect separation
  23. 23. 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 area codes 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  24. 24. Mission Accomplished We just add more dimensions… right?
  25. 25. If not for the… Curse of Dimensionality
  26. 26. Source: https://commons.wikimedia.org/w/index.php?curid=2257082
  27. 27. Source: https://commons.wikimedia.org/w/index.php?curid=2257082
  28. 28. Dimensionality and Sparseness Height (mm) Weight[10-1kg]
  29. 29. Dimensionality and Sparseness Height (mm) Weight[10-1kg]
  30. 30. InfoSec Applications File Analysis
  31. 31. EngineeredFEATURES forExecutableFiles 32/64BIT EXECUTABLE SUBSYSTEM TYPE MACHINE INSTRUCTION DISTRIBUTION FILESIZE TIMESTAMP DEBUG INFORMATION PRESENT PACKERTYPE FILEENTROPY NUMBEROF SECTIONS NUMBER WRITABLE SECTIONS NUMBER READABLE SECTIONS NUMBER EXECUTABLE SECTIONS DISTRIBUTION OFSECTION ENTROPY IMPORTEDDLL NAMES IMPORTED FUNCTION NAMES COMPILER ARTIFACTS LINKER ARTIFACTS RESOURCE DATA PROTOCOL STRINGS IPS/DOMAINS PATHS PRODUCT METADATA DIGITAL SIGNATURE ICON CONTENT …
  32. 32. • Unstructured file content • Algorithm uncovers interesting properties • Requires a lot more more input data • Unlocks more insight • “Deep Learning”
  33. 33. String-based feature Executablesectionsize-basedfeature
  34. 34. Subspace Projection A SubspaceProjectionB
  35. 35. Classification Performance
  36. 36. 99%DETECTIONRATE 1%FALSEPOSITIVES Malware?
  37. 37. Malware 99%DETECTIONRATE 1%FALSEPOSITIVES
  38. 38. 99%DETECTIONRATE 1%FALSEPOSITIVES Not Malware
  39. 39. 99% True Positive RateChanceofatleastone successforadversary Number of attempts 1% >99.3% 500
  40. 40. Why does this matter?
  41. 41. • Large datasets require algorithmic approaches – Many sensors, e.g. IoT – Large input, e.g. video surveillance – Complex relationships, e.g. social graph • Hidden structure • Better accuracy, better response time
  42. 42. • Making the most out of available data • Less friction, better customer experience • Automation • Empiricism (but careful of bias in input data) Why deploy an ML-based technology?
  43. 43. • Increasingly effective and viable technology • Mind the innovator’s dilemma • Replace rule-based systems – ML modeling is repeatable – Maintainability – Measurability Why build ML-enabled products?
  44. 44. • True positive/false positive trade-off – ROC curve – Base rate – Overfitting • What is the data? – Does the data intuitively contain signal? – What is the system trained on? • Training data applicable to your use case • Ground truth Beyond the Hype: Recognizing Solid ML
  45. 45. • Making defense easier • But: also making attack easier – Adversarial models – Adversarial examples “Adversarial Patch,” Brown et al., https://arxiv.org/abs/1712.09665
  46. 46. • Autonomous systems – Malicious use of e.g. drones – Manipulating autonomous systems (self-driving cars) • Spoofing – Lyrebird – DeepFake • Adversarial data – Circumvent facial recognition – Road signs etc. Some Adversarial Challenges for the Physical Domain
  47. 47. >>> from sklearn.datasets import load_iris >>> from sklearn import tree >>> iris = load_iris() >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(iris.data, iris.target) >>> clf.predict(iris.data[:1, :]) array([0]) Getting Started with Scikit-Learn Source: http://scikit-learn.org/stable/modules/tree.html#classification
  48. 48. https://developers.google.com/machine-learning/crash-course/
  49. 49. Questions

×