Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Data Science Techniques to Detect Malicious Behavior

1,444 views

Published on

Data science techniques can help organizations solve their security problems — but they aren’t a silver bullet. Working directly with customers, Endgame has been able to match the right science to unsolved customer security challenges to create effective solutions. In this talk, you will experience a small part of that process by learning:
-How machine learning techniques can be used to find security insights in large amounts of data.
-The difference between supervised and unsupervised learning and the different types of security problems they can solve.
-How a lack of labeled data and the high cost of misclassifications present challenges to data scientists in the security industry.
-How Endgame has used an unsupervised clustering technique to group cloud-based infrastructure, a fundamental step in the detection of malicious behavior.

Published in: Technology
  • Be the first to comment

Using Data Science Techniques to Detect Malicious Behavior

  1. 1. Using Data Science Techniques to Help Detect Malicious Behavior Phil Roth, Data Scientist
  2. 2. • An introduction to key data science concepts • Challenges that exist to applying those concepts to security data • Why focusing on aiding a human security analyst can lead to better machine learning tools • How Endgame’s enterprise product benefits from that focus Key Takeaways
  3. 3. Data Science Process
  4. 4. Gather Raw Data Process and Clean Data Explore the Data Apply a Model Communicate the Result Data Science Process
  5. 5. Data Science Process Data can come from many disparate sources. Raw data must be cleaned and features extracted Gather Raw Data Process and Clean Data Explore Data Finding relationships in the data provides hints about what features and models will be useful.
  6. 6. Data Science Process Models exploit features and relationships in the data to make a statement. Apply a Model Communicate the Result The output of a data product is useless without effective and actionable communication.
  7. 7. Introduction to Machine Learning Models
  8. 8. In supervised learning, input data is labeled. An algorithm attempts to reproduce those labels on new unlabeled data. input data label -3 -4 1 0 1 -4 -3 1 1 1 -4 -4 0 0 1 +4 +3 1 0 0 +3 +4 0 1 0 +3 +3 1 0 0 new data label -3 -4 1 1 ??? Supervised learning
  9. 9. A Support Vector Machine1 finds the best separating boundary between two classes in space. Supervised learning example 1 http://scikit-learn.org/stable/modules/svm.html
  10. 10. In unsupervised learning, input data is unlabeled. An algorithm attempts to find hidden structure in that data. input data -3 -4 1 0 -4 -3 1 1 -4 -4 0 0 +4 +3 1 0 +3 +4 0 1 +3 +3 1 0 group 1 group 2 Unsupervised learning
  11. 11. Unsupervised learning example step 1: step 2: etc… k-means clustering iteratively improves the location of cluster centers by moving them closer to cluster means
  12. 12. Challenges with Security Data
  13. 13. Recommendation Systems Character Recognition MNIST Database of Handwritten Digits Security lacks open datasets
  14. 14. The DARPA Intrusion Detection Evaluation dataset is 15 years old, simulated, and techniques trained on it were never actionable. Sharing data in the security industry will always be a challenge that even President Obama is attempting to address. Security lacks open datasets
  15. 15. Labeling is an expensive process that requires expertise. vs. Security lacks easy labels Is this binary malicious? Is this traffic an intrusion? Are these products related?
  16. 16.  False positives lead to expensive analyst investigations and alert fatigue and  False negatives get CEOs fired Security lacks tolerance for errors
  17. 17. Machine Learning in security could benefit from focusing on “human in the loop” products over “the algorithm does it all” products Chess Analogy 1997: IBM’s supercomputer Deep Blue vs. Gary Kasparov 2005: Team ZachS vs multiple Grandmasters in Freestyle Chess2 Human/Machine teams retained an edge over machines for decades 2 Cowen, Tyler. Average Is Over. Chapter 5. 2013
  18. 18. Using the Human/Machine Model
  19. 19. Cloud deployed virtual machines are clustered based on their behavior. The results are communicated to analysts and used to improve the detection of malicious behavior. Endgame Implementation
  20. 20. Package, process, and user information is collected from the machines. DBSCAN, a clustering algorithm, groups the machines based on that information. Endgame implementation
  21. 21. • An introduction to key data science concepts • Existing challenges to applying those concepts to security data • Why focusing on aiding a human security analyst can lead to better machine learning tools • How Endgame’s enterprise product benefits from that focus Key Takeaways
  22. 22. For more information contact: egs-info@endgame.com

×