Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Never-Ending Language Learner
“But Watson couldn’t distinguish between polite language and profanity — 
which the Urban Dictionary is full of” 
- Eric B...
Subverting Machine Learning 
for Fun And Profit 
Ram Shankar Siva Kumar, John Walton 
Email: Ram.Shankar@Microsoft.com; Jo...
Goals 
• This talk: 
• Is a primer on Adversarial Machine Learning 
• Will show, through a sampling, how ML algorithms are...
Agenda 
• Motivation to Attack ML systems 
• Practical Attacks and Defenses 
• Best Practices
ML is everywhere… 
“Machine Learning is shifting from an academic discipline to an 
industrial tool” – John Langford
In Security…!! 
“The only effective approach to 
defending against today’s ever-increasing 
volume and diversity of 
attac...
Computer 
System 
Data 
Program 
Output 
Computer 
System 
Data 
Output 
Program 
Traditional Programming 
Machine Learnin...
Things to note about 
• For the program to be functional, input data must be functional 
• What does a program/model look ...
Malicious Mindset 
• Data and parameters define the model 
• By controlling the data or parameters, you can change the mod...
The mother lode 
Data is collected 
Data is within 
anomaly detector’s 
purview 
Anomaly is 
significant for 
detector 
An...
Putting it all together 
• Opportunity = ML is/will be everywhere 
• Prevalence = ML is/will be widely used in security 
•...
Agenda 
• Motivation to Attack ML systems 
• Practical Attacks and Defenses 
• Intuitive understanding of the algorithm 
•...
About the dataset 
• Used Enron Spam Dataset 
• Came out of the Federal investigation of Enron corporation 
• Real world c...
Word P(Word|Spam) P(Word|Ham) 
Assets 0/3 2/3 
Assignment 0/3 2/3 
Cialis 3/3 0/3 
Group 0/3 2/3 
Viagra 1/3 0/3 
Vallium ...
Before Attack 
• Built a vanilla Naïve Bayes classifier on Enron email dataset (with 
some normalizations) 
• Goal: Given ...
After the attack 
• Good Word Attack: Introduce innocuous words in the message 
E.g: Gas Meeting Expense Report Payroll 
-...
Takeaway 
• How to use in real-world: Spear phishing 
• By manipulating the input to the algorithm, we can increase the fa...
Support Vector Machines – The Ferrari of ML 
• Immensely popular 
• Quite fast 
• Deliver a solid performance 
• Widely us...
Intuition 
Which is the right decision boundary?
SVM Intuition 
Choose the hyperplane, that maximizes the 
margin between the positive and negative 
examples! 
Those examp...
Facts about SVMs 
• Output of SVM = a set of weights + Support vectors 
• Once you have the support vectors {special point...
Going after support vectors
Takeaway 
• How it can be used in real-world: Fool the malware classifier 
• Changes to support vectors, lead to changes i...
Clustering 
• Widely used learning algorithm for 
anomaly detection
Attack Intuition 
Center 
Before Attack 
After Attack 
Attack Point 
to be included 
Source:Laskov, Pavel, and Marius Klof...
Takeaway 
• In order to attack the algorithm, we don’t change the parameter 
(centroid) -> Simply send in data as part of ...
Summary of Attacks 
Algorithm Result of Attack What does this mean? 
Naïve Bayes Increased false positive rate You can mak...
Ensembling – You can’t fool ‘em all 
- Build separate models to detect 
malicious activity 
- The models are chosen so tha...
• Used Gaussian Naïve Bayes, linear SVM in addition to Naïve Bayes 
• Used a simple majority voting method, to combine the...
Using Robust Learning Methods 
• Intuition: Treat the tainted data points 
as outliers (presumably because of 
noise) 
Out...
Instead of Consider 
Vanilla Naïve Bayes Multinomial Model (even better 
than multivariate Bernoulli model) 
SVM Robust SV...
Caution! 
• Pros: Well studied field with a gamut of choices 
• Optimization perspective 
• Game Theoretic perspective 
• ...
Agenda 
• Motivation to Attack ML systems 
• Practical Attacks and Defenses 
• Best Practices
Threat Modeling 
• Adversary Goal - Evasion? Poisoning? Deletion? 
• Adversary’s knowledge – Perfect Knowledge? Limited Kn...
Tablestakes 
• Secure log sources 
• Secure your storage space 
• Monitor data quality 
• Treat parameters and features as...
3 Key Takeaways 
1) Naïve implementation of machine Learning Algorithms are 
vulnerable to attacks. 
2) Attackers can evad...
Thank you! 
- TwC: Tim Burell 
- Azure Security: Ross Snider, Shrikant 
Adhirkala, Sacha Faust Bourque, 
Bryan Smith, Marc...
Subverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profit
Upcoming SlideShare
Loading in …5
×

Subverting Machine Learning Detections for fun and profit

2,252 views

Published on

Why it is easy to attack Machine learning systems

Published in: Technology
  • Be the first to comment

Subverting Machine Learning Detections for fun and profit

  1. 1. Never-Ending Language Learner
  2. 2. “But Watson couldn’t distinguish between polite language and profanity — which the Urban Dictionary is full of” - Eric Brown (IBM)
  3. 3. Subverting Machine Learning for Fun And Profit Ram Shankar Siva Kumar, John Walton Email: Ram.Shankar@Microsoft.com; JoWalt@Microsoft.com
  4. 4. Goals • This talk: • Is a primer on Adversarial Machine Learning • Will show, through a sampling, how ML algorithms are vulnerable • Illustrates how to defend against such attacks • This talk IS NOT • An exhaustive review of all algorithms • End goal: Gain an intuitive understanding of ML algorithms and how to attack them
  5. 5. Agenda • Motivation to Attack ML systems • Practical Attacks and Defenses • Best Practices
  6. 6. ML is everywhere… “Machine Learning is shifting from an academic discipline to an industrial tool” – John Langford
  7. 7. In Security…!! “The only effective approach to defending against today’s ever-increasing volume and diversity of attacks is to shift to fully automated systems capable of discovering and neutralizing attacks instantly.” - Mike Walker (on DARPA Cyber Grand Challenge)
  8. 8. Computer System Data Program Output Computer System Data Output Program Traditional Programming Machine Learning Source: Lectures by Pedro Domingos
  9. 9. Things to note about • For the program to be functional, input data must be functional • What does a program/model look like? • Literally, bunch of numbers and data points • The output model can be expressed in terms of parameters: Linear Regression y = 225x + 875 3,500 3,000 2,500 2,000 1,500 1,000 500 0 R² = 0.574 1 2 3 4 5 6 7 8 Number of Logons Time Non-Linear y = 982.23e0.1305x R² = 0.6624 3,500 3,000 2,500 2,000 1,500 1,000 500 0 1 2 3 4 5 6 7 8 Number of Logons = 225 * Time + 875 Number of Logons = 982*e 0.1305* (Time)
  10. 10. Malicious Mindset • Data and parameters define the model • By controlling the data or parameters, you can change the model • Where do you find them? • Data • At the source • Collected in a big data store • Stored in the cloud (MLaaS) • Parameters: • Code repository
  11. 11. The mother lode Data is collected Data is within anomaly detector’s purview Anomaly is significant for detector Anomaly is surfaced! Source: Arun Viswanathan, Kymie Tan, and Clifford Neuman, Deconstructing the Assessment of Anomaly-based Intrusion Detectors, RAID 2013.
  12. 12. Putting it all together • Opportunity = ML is/will be everywhere • Prevalence = ML is/will be widely used in security • Ease = (most) ML algorithms can be easily subverted by controlling data/parameters • High rate of return = Once subverted, you can evade or even control the system Opportunity * Prevalence * Ease * High Rate of Return =
  13. 13. Agenda • Motivation to Attack ML systems • Practical Attacks and Defenses • Intuitive understanding of the algorithm • How the system looks before the attack? • How the system looks after the attack? • How to defend from these attacks? • Takeaway – From Evasion to total control of the system • Best Practices
  14. 14. About the dataset • Used Enron Spam Dataset • Came out of the Federal investigation of Enron corporation • Real world corpus of spam and ham messages. • 619,446 email messages belonging to 158 users. After cleaning it up (removing duplicate messages, discussion threads), you end up with 200,399 messages.
  15. 15. Word P(Word|Spam) P(Word|Ham) Assets 0/3 2/3 Assignment 0/3 2/3 Cialis 3/3 0/3 Group 0/3 2/3 Viagra 1/3 0/3 Vallium 2/3 0/3 Naïve Bayes Algorithm Choose whichever probability is higher: 푃 푆푝푎푚 푀 ∝ 푃 푆푝푎푚 ∗ 푃(W|Spam) 푃 퐻푎푚 푀 ∝ 푃 퐻푎푚 ∗ 푃(W|Ham) P(Spam|M) = 0.5*(0/3)*(0/3)*(0/3) = 0 P(Ham|M) = 0.5*(2/3)*(2/3)*(2/3) = 0.14 Since 0.14 > 0 => Message is more likely to be Ham
  16. 16. Before Attack • Built a vanilla Naïve Bayes classifier on Enron email dataset (with some normalizations) • Goal: Given a new subject, can I predict if it is spam or ham? • Testing on 20% of data, you get test accuracy of 62%
  17. 17. After the attack • Good Word Attack: Introduce innocuous words in the message E.g: Gas Meeting Expense Report Payroll -> Test Accuracy dropped to 52.8% 100 80 60 40 20 0 0 10 20 30 False Positive Rate Number of Benign words added
  18. 18. Takeaway • How to use in real-world: Spear phishing • By manipulating the input to the algorithm, we can increase the false positive rate • Make the system unusable!
  19. 19. Support Vector Machines – The Ferrari of ML • Immensely popular • Quite fast • Deliver a solid performance • Widely used in classification setting In Security setting, beginning to gain popularity in the Malware community. • Goal: Given a piece of code, is it Malicious or benign?
  20. 20. Intuition Which is the right decision boundary?
  21. 21. SVM Intuition Choose the hyperplane, that maximizes the margin between the positive and negative examples! Those examples on the boundary are called support vectors!
  22. 22. Facts about SVMs • Output of SVM = a set of weights + Support vectors • Once you have the support vectors {special points in the training data}, rest of the training data can be thrown away • Takeaway: A good part of the model, is determined by support vectors • Intuition: Controlling the support vectors, should help us to control the model
  23. 23. Going after support vectors
  24. 24. Takeaway • How it can be used in real-world: Fool the malware classifier • Changes to support vectors, lead to changes in decision boundary
  25. 25. Clustering • Widely used learning algorithm for anomaly detection
  26. 26. Attack Intuition Center Before Attack After Attack Attack Point to be included Source:Laskov, Pavel, and Marius Kloft. "A framework for quantitative security analysis of machine learning." Proceedings of the 2nd ACM workshop on Security and artificial intelligence. ACM, 2009.
  27. 27. Takeaway • In order to attack the algorithm, we don’t change the parameter (centroid) -> Simply send in data as part of “normal” traffic • Increased the false negative rate
  28. 28. Summary of Attacks Algorithm Result of Attack What does this mean? Naïve Bayes Increased false positive rate You can make the system unusable K-means clustering Increased false negative rate You can evade detection SVM Control of the decision boundary You have full control of what gets alerted and what doesn’t
  29. 29. Ensembling – You can’t fool ‘em all - Build separate models to detect malicious activity - The models are chosen so that they are orthogonal - Each model independently assess for maliciousness - Results are combining using a separate function
  30. 30. • Used Gaussian Naïve Bayes, linear SVM in addition to Naïve Bayes • Used a simple majority voting method, to combine the three outputs.
  31. 31. Using Robust Learning Methods • Intuition: Treat the tainted data points as outliers (presumably because of noise) Outlier?
  32. 32. Instead of Consider Vanilla Naïve Bayes Multinomial Model (even better than multivariate Bernoulli model) SVM Robust SVM (feature noise, and label noise) K-means with finite window K-means with infinite window Logistic Regression Robust Logistic Regression using Shift Parameters Vanilla PCA Robust PCA with Laplcian Threshold (Antidote)
  33. 33. Caution! • Pros: Well studied field with a gamut of choices • Optimization perspective • Game Theoretic perspective • Statistical perspective • Cons: • Some of these algorithms have higher computational complexity than standard algorithms Standard SVM: 10 minutes Robust SVM: 1 hr and 8 mins (Single node implementation, 50k data points, 20% test, no kernel ) • Requires a lot more tuning and babysitting
  34. 34. Agenda • Motivation to Attack ML systems • Practical Attacks and Defenses • Best Practices
  35. 35. Threat Modeling • Adversary Goal - Evasion? Poisoning? Deletion? • Adversary’s knowledge – Perfect Knowledge? Limited Knowledge? • Training set or part of it • Feature representation of each sample • Type of a learning algorithm and the form of its decision function • Parameters and hyper-parameters of the learned model • Feedback from the classifier; e.g., classifier labels for samples chosen by the adversary. • Attacker’s capability • Ability to modify – Complete or partial? Source:Biggio, Battista, Blaine Nelson, and Pavel Laskov. "Poisoning attacks against support vector machines." arXiv preprint arXiv:1206.6389 (2012).
  36. 36. Tablestakes • Secure log sources • Secure your storage space • Monitor data quality • Treat parameters and features as secrets • Don’t use publically available datasets to train your system • When designing the system, avoid interactive feedback
  37. 37. 3 Key Takeaways 1) Naïve implementation of machine Learning Algorithms are vulnerable to attacks. 2) Attackers can evade detections, cause the system to be unusable or even control it. 3) Trustworthy results depend on trustworthy data.
  38. 38. Thank you! - TwC: Tim Burell - Azure Security: Ross Snider, Shrikant Adhirkala, Sacha Faust Bourque, Bryan Smith, Marcin Olszewski, Ashish Kurmi, Lars Mohr, Ben Ridgway - O365 Security: Dave Hull, Chetan Bhat, Jerry Cochran - MSR: Jay Stokes, Gang Wang (intern) - LCA: Matt Sommer Source: http://www.lecun.org/gallery/libpro/20011121-allyourbayes/dsc01228-02-h.jpg

×