Security testing and reviewing efforts are a necessity for software projects, but are time-consuming and expensive to apply. Identifying vulnerable code supports decision-making during all phases of software development. An approach for identifying vulnerable code is to identify its attack surface, the sum of all paths for untrusted data into and out of a system. Identifying the code that lies on the attack surface requires expertise and significant manual effort. This paper proposes an automated technique to empirically approximate attack surfaces through the analysis of stack traces. We hypothesize that stack traces from user-initiated crashes have several desirable attributes for measuring attack surfaces. The goal of this research is to aid software engineers in prioritizing security efforts by approximating the attack surface of a system via stack trace analysis. In a trial on Windows 8, the attack surface approximation selected 48.4% of the binaries and contained 94.6% of known vulnerabilities. Compared with vulnerability prediction models (VPMs) run on the entire codebase, VPMs run on the attack surface approximation improved recall from .07 to .1 for binaries and from .02 to .05 for source files. Precision remained at .5 for binaries, while improving from .5 to .69 for source files.
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Approximating Attack Surfaces with Stack Traces [ICSE 15]
1. Christopher Theisen†, Kim Herzig‡, Patrick Morrison†, Brendan Murphy‡,
Laurie Williams†
†North Carolina State University
‡Microsoft Research, Cambridge UK
Approximating Attack Surfaces
with Stack Traces
4. Before we start…
What is the “Attack Surface” of a system?
Ex. early approximation of attack surface – Manadhata [2]:
Only covers API entry points
…easy to say, hard to define (practically).
The (OWASP) Attack Surface of an application is: [1]
1. …paths into and out of the application
2. the code that protects these paths
3. all valuable data used in the application
4. the code that protects data
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 2/17
[1] https://www.owasp.org/index.php?title=Attack_Surface_Analysis_Cheat_Sheet&oldid=156006
[2] Manadhata, P., Wing, J., Flynn, M., & McQueen, M. (2006, October). Measuring the attack surfaces of two FTP daemons. In Proceedings of the 2nd
ACM workshop on Quality of protection (pp. 3-10). ACM
5. Our goal is to aid software engineers in
prioritizing security efforts by
approximating the attack surface of a
system via stack trace analysis.
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 3/17
6. Proposed Solution
Stack traces represent user activity that puts the system under stress
There’s a defect of some sort; does it have security implications?
Stack traces may localize security flaws
Crashes caused by user activity
Bad input that was handled improperly, et cetera
Crashes are a DoS attack by definition; you brought the service or
system down!
Hardware crashes are excluded
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 4/17
7. Research Questions
RQ1: How effectively can stack traces to be used to
approximate the attack surface of a system?
RQ2: Can the performance of vulnerability prediction be
improved by limiting the prediction space to the
approximated attack surface?
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 5/17
8. Overview
Catalog all code that appears on stack traces
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 6/17
9. Overview
Catalog all code that appears on stack traces
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 6/17
10. Overview
Catalog all code that appears on stack traces
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 6/17
11. Data Sources
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[4] "Description of the Dr. Watson for Windows," Microsoft Corporation, [Online]. Available: http://support.microsoft.com/kb/308538/en-us.
7/17
12. Attack Surface Construction (RQ1)
Data source, Crash ID, binary [4000+], filename [100,000+], function [10,000,000+]
Crashes Provide:
Binary
Function
foo!foobarDeviceQueueRequest+0x68
foo!fooDeviceSetup+0x72
foo!fooAllDone+0xA8
bar!barDeviceQueueRequest+0xB6
bar!barDeviceSetup+0x08
bar!barAllDone+0xFF
center!processAction+0x1034
center!dontDoAnything+0x1030
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 8/17
13. Results (RQ1)
Fuzzing
User Induced
Crashes
%binaries 0.9% 48.4%
%vulnerabilities 14.9% 94.6%
Microsoft targets fuzzing towards high-risk modules
We are covering the majority of vulnerabilities seen!
Targeting different crashes gets different results
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 9/17
14. Prediction Models (RQ2)
We believe that the key for [improving prediction] is by:
(1) developing new prediction techniques that deal with the
“needle in the haystack” problem
(2) finding new metrics that deal with the unique characteristics
of vulnerabilities and attacks.
Zimmermann et al. study [3]:
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[3] T. Zimmermann, N. Nagappan and L. Williams, "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," in Software
Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010
10/17
15. Prediction Models (RQ2)
We believe that the key for [improving prediction] is by:
(1) developing new prediction techniques that deal with the
“needle in the haystack” problem
(2) finding new metrics that deal with the unique characteristics
of vulnerabilities and attacks.
Zimmermann et al. study [3]:
Stack traces point to where flawed code lives!
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[3] T. Zimmermann, N. Nagappan and L. Williams, "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," in Software
Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010
10/17
16. Prediction Model Construction (RQ2)
Replicated the VPM from Windows Vista study
Run the VPM with all files considered as possibly vulnerable
Repeat, but remove code not found on stack traces
Vulnerability Prediction Model (VPM)
29 metrics in 6 categories:
Churn
Dependency
Legacy
CODEMINE data [5]
Size
Defects
Pre-release vulnerabilities
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[5] J. Czerwonka, N. Nagappan, W. Schulte and B. Murphy, "CODEMINE: Building a Software Development Data Analytics Platform at Microsoft,"
Software, IEEE, vol. 30, no. 4, pp. 64--71, 2013.
11/17
17. Results (RQ2)
Comparing the VPM
run on all files vs. just
attack surface files…
Precision improved
from 0.5 to 0.69
Recall improved from
0.02 to 0.05
Statistical improvement
Practical? No.
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 12/17
18. Problems with Precision [6]
No. Low precision is fine in several situations.
When the cost of missing the target is prohibitively expensive.
When only a small fraction [of] the data is returned.
When there is little or no cost in checking false alarms.
Are low precision predictors unsatisfactory?
…especially on highly imbalanced datasets.
Recall and precision like to compete
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 13/17
[6] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007. Problems with Precision: A Response to "Comments on 'Data
Mining Static Code Attributes to Learn Defect Predictors'". IEEE Trans. Softw. Eng. 33, 9 (September 2007)
19. Problems with Precision [6]
No. Low precision is fine in several situations.
When the cost of missing the target is prohibitively expensive.
When only a small fraction [of] the data is returned.
When there is little or no cost in checking false alarms.
This seems appropriate for security flaws!
Are low precision predictors unsatisfactory?
…especially on highly imbalanced datasets.
Recall and precision like to compete
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[6] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007. Problems with Precision: A Response to "Comments on 'Data
Mining Static Code Attributes to Learn Defect Predictors'". IEEE Trans. Softw. Eng. 33, 9 (September 2007)
13/17
20. Lessons Learned - Visualizations
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 14/17
21. Limitations
Stack traces are a good metric for Windows 8…
Different levels of granularity? (File/Function)
Smaller projects? Open source?
Not operating systems?
Results don’t necessarily generalize
Other learners?
Oversampling and Undersampling?
What else can we do with VPM’s?
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 15/17
22. Future Work
What else can we do with stack traces?
Frequency of appearance
Dependencies, not the entities themselves
How many stack traces are required?
Sliding window; how does the approximation change over time?
Additional Metrics
Visualization Plugin for IDEs
…does it actually help?
Tool Development
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 16/17