Decoding Patterns: Customer Churn Prediction Data Analysis Project
Risk-Based Attack Surface Approximation: How Much Data is Enough? [ICSE - SEIP 2017]
1. Risk-Based Attack Surface Approximation:
How Much Data is Enough?
Chris Theisen, Brendan Murphy, Kim Herzig, Laurie Williams
North Carolina State University
Microsoft Research
2.
3. Introduction
What is the “Attack Surface”? Quoting the Open Web Application
Security Project…
• All paths for data and commands in a software system
• The data that travels these paths
• The code that implements and protects both
Concept used for security effort prioritization.
3
Introduction | Background | Methodology | Results | Conclusion
4. 4
Crashes represent activity that put the system under
stress.
Stack Traces tell us what happened.
foo!foobarDeviceQueueRequest+0x68
foo!fooDeviceSetup+0x72
foo!fooAllDone+0xA8
bar!barDeviceQueueRequest+0xB6
bar!barDeviceSetup+0x08
bar!barAllDone+0xFF
center!processAction+0x1034
center!dontDoAnything+0x1030
Risk-Based Attack Surface Approximation
(RASA)
Introduction | Background | Methodology | Results | Conclusion
5. • Previous RASA study used tens of millions of crashes.
• Previous study was per binary.
Previously…
5
[SEIP ‘15] Chris Theisen, Kim Herzig, Pat Morrison, Brendan Murphy, and Laurie Williams, “Approximating Attack Surfaces with Stack Traces”, in
Companion Proceedings of the 37th International Conference on Software Engineering (2015).
[SEIP ‘15] Crashes
%binaries 48.4%
%vulnerabilities 94.6%
Introduction | Background | Methodology | Results | Conclusion
6. • Previous RASA study used tens of millions of crashes.
• Previous study was per binary.
Previously…
6
[SEIP ‘15] Chris Theisen, Kim Herzig, Pat Morrison, Brendan Murphy, and Laurie Williams, “Approximating Attack Surfaces with Stack Traces”, in
Companion Proceedings of the 37th International Conference on Software Engineering (2015).
[SEIP ‘15] Crashes
%binaries 48.4%
%vulnerabilities 94.6%
Great! All done, right?
Introduction | Background | Methodology | Results | Conclusion
7. Practitioner Problems
• Previous RASA study used tens of millions of crashes.
• Previous study was per binary.
7
Introduction | Background | Methodology | Results | Conclusion
8. Practitioner Problems
• Previous RASA study used tens of millions of crashes.
• Previous study was per binary.
• Practitioners had some issues with it…
– “Binary prioritization isn’t actionable.”
8
Introduction | Background | Methodology | Results | Conclusion
9. Practitioner Problems
• Previous RASA study used tens of millions of crashes.
• Previous study was per binary.
• Practitioners had some issues with it…
– “Binary prioritization isn’t actionable.”
– “We don’t have that much data!”
9
Introduction | Background | Methodology | Results | Conclusion
10. Practitioner Problems
• Previous RASA study used tens of millions of crashes.
• Previous study was per binary.
• Practitioners had some issues with it…
– “Binary prioritization isn’t actionable.”
– “We don’t have that much data!”
– “We don’t store every crash we received, we don’t
see the value in that.”
10
Introduction | Background | Methodology | Results | Conclusion
11. Practitioner Problems
• Previous RASA study used tens of millions of crashes.
• Previous study was per binary.
• Practitioners had some issues with it…
– “Binary prioritization isn’t actionable.”
– “We don’t have that much data!”
– “We don’t store every crash we received, we don’t
see the value in that.”
– “We don’t have historical vulnerabilities to use as a
goodness measure.”
11
Introduction | Background | Methodology | Results | Conclusion
12. Research Questions
• RQ1: Can the RASA approach be implemented at the
source code file level with actionable results?
• RQ2: How does random sampling of crash dump stack
traces effect RASA?
12
Introduction | Background | Methodology | Results | Conclusion
13. Data Sources
• Mozilla Firefox
– ~1M crashes
– Vulnerability data from Mozilla Security
Blog and bug tracker
• Windows 8.1
– ~9M crashes
– Vulnerability data from internal data
sources
13
Introduction | Background | Methodology | Results | Conclusion
22. Why Does Sampling Work?
• Crashes tend not to happen in isolation.
– If something crashes once, it will likely crash again.
• For Firefox, only 6 files in the data set with a vulnerability
had only one crash occurrence.
– Against ~300 vulnerable files, 50,000 total files
• If foo.cpp crashes many times, random sampling unlikely
to remove all foo.cpp’s from the dataset.
22
Introduction | Background | Methodology | Results | Conclusion
23. Future Work
• We have a list of vulnerable files; now what?
– Further prioritization to assist developers.
• We’re looking at:
– How the attack surface changes over time.
– How the complexity of the attack surface predicts
vulnerabilities.
– How proximity to the boundary of a software
system predicts vulnerabilities.
23
Introduction | Background | Methodology | Results | Conclusion
24. Conclusions
• “Binary prioritization isn’t actionable.”
– RASA can prioritize security effort effectively at the
source code file level.
24
Introduction | Background | Methodology | Results | Conclusion
25. Conclusions
• “Binary prioritization isn’t actionable.”
– RASA can prioritize security effort effectively at the
source code file level.
• “We don’t have that much data!”
– Orders of magnitude less data required compared
to previous studies.
25
Introduction | Background | Methodology | Results | Conclusion
26. Conclusions
• “We don’t store every crash we received, we don’t see
the value in that.”
– A naïve approach like random sampling still works.
26
Introduction | Background | Methodology | Results | Conclusion
27. Conclusions
• “We don’t store every crash we received, we don’t see
the value in that.”
– A naïve approach like random sampling still works.
• “We don’t have historical vulnerabilities to use as a
goodness measure.”
– Satisfied previous complaints with less data, naïve
sampling; evidence it will work on new systems.
27
Introduction | Background | Methodology | Results | Conclusion
In 2010 paper, Tom Zimmermann compared finding vulnerabilities in code to finding “a needle in a haystack.”
Based on my experience as a security engineer, it can be more like finding a needle in a field of them!
Most difficult part of my job.
One prioritization technique is to identify the attack surface.
Commonly used for prioritization of effort, but typically based on the common knowledge of the team; imperfect.
We developed approach called Risk-Based Attack Surface Approximation, or RASA…
Hypothesis was that code that crashes shares properties with vulnerable code.
Attempting to crash systems is one of the primary tools in forensics/red-team activity; we’re reverse-engineering what attackers do.
So we evangelized this approach to industry days at NCSU, tech talks, etc.
Got great feedback on making RASA actionable.
I want to highlight a few of the words from the previous slide.
When we follow up with, “well, just run it and see!” the response we got was…
So we need to run studies at a lower level of granularity, with a lot less data, while still having historical vulnerabilities to compare against.
Mine out individual code artifacts (files, in this case), from the stack traces in crash dumps.
Map code mined from each crash to code in source control; use binary/function mapping for files, if files aren’t in crash
The resultant pairing between code on crashes to source control code is our approximation of the attack surface.
Simple by design; not language limited, works in multiple domains, just need crashes with stack traces. Highly flexible!
To limit data, we do random sampling of crashes placed into the process at the first step. 10% of crashes, 20% of crashes, et cetera.
Do multiple samples at each “level” (10%, 20%, etc), also look at how the attack surface changes from sample to sample at each sampling level.
Trying to answer; does our approximation change greatly with different samplings of crash dump stack traces?
Chart of percentage of files on the attack surface vs. vulnerabiltiies covered by attack surface.
Vulnerabilities are 5 times as likely to be in code that crashes than not!
Great place to start to run other tools, like static analysis, vulnerability prediction models, etc. Limits the work you need to do.
Tiny variations, can use a quarter of the data available and our metrics change less than a percentage point with no change between samples.
Comparison against previous study; we see a similar 2:1 ratio from vulnerabilities to files, so vulnerabilities are twice as dense in crashing code across all samples.