This project explores data quality for software vulnerability datasets, and provides solutions for automated data cleaning frameworks to improve data quality and downstream tasks.
1. The University of Adelaide
Data Quality for Software Vulnerability
Datasets
Centre of Research on Engineering Software Technologies (CREST - @crest_uofa)
School of Computer Science, The University of Adelaide, Australia
Cyber Security Cooperative Research Centre, Australia
The 45th International Conference on Software Engineering (ICSE ‘23)
May 17, 2023
Roland Croft
roland.croft@adelaide.edu.au
M. Ali Babar
ali.babar@adelaide.edu.au
Mehdi Kholoosi
mehdi.kholoosi@adelaide.edu.au
2. Growth of AI
The University of Adelaide Slide 2
AI is beginning to shape
software development and
software quality assurance.
3. Software Vulnerability Prediction
The University of Adelaide Slide 3
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
4. Software Vulnerability Prediction
The University of Adelaide Slide 4
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Data is the core
component of any
data-driven pipeline:
“Garbage In, Garbage Out”
5. Software Vulnerability Datasets
The University of Adelaide Slide 5
Weak
Supervision
1. Vulnerability Reports
2. Development Commit
Logs
3. Static Analysis Tools
4. Synthetic Data
6. Research Objective
The University of Adelaide Slide 6
Aim
Outcomes
Inform the state of software
vulnerability data quality and the
reliability of downstream tasks.
1
Enable automated data cleaning
frameworks to improve data quality
and downstream tasks.
2
To gain deep understanding into
the nature of data quality for
software vulnerability datasets.
8. Research Design
The University of Adelaide Slide 8
Data Quality Attributes
Accuracy
1
Completeness
4
Uniqueness
2
Consistency
3
Currentness
5
9. Research Design
The University of Adelaide Slide 9
Labelling Heuristic: Selected Dataset:
Security Big-Vul
Developer Devign
Tool D2A
Synthetic Juliet Test Suite
10. Research Design
The University of Adelaide Slide 10
Inspect change in model
performance caused by
attempting to reduce data
quality issues.
11. Findings - Accuracy
The University of Adelaide Slide 11
“The degree to which the data has attributes that correctly represent the
true value of the intended attribute of a concept or event in a specific
context of use.”
Big-Vul 54.3%
Devign 80.0%
28.6%
D2A
100%
Juliet
Manually inspect
label correctness
-50%
Lower performance
on true labels
-29%
-80%
12. Findings - Uniqueness
The University of Adelaide Slide 12
“The degree to which there is no duplication in records.”
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Security Developer Tool Synthetic
Model Performance with and without
duplicates
Original No duplicates
-13.9%
-81.7%
-10.4%
Big-Vul 83.0%
Devign 89.9%
2.1%
D2A
16.3%
Juliet
13. Key Takeaways
The University of Adelaide Slide 13
State of the art software vulnerability datasets are imperfect.
Data quality significantly affects the performance of downstream software security
models.
We need better cleaning methods or more robust models to ensure reliability and
effective data driven software security.
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul
0.543 0.830 0.999 0.824 0.761
Devign
0.800 0.899 0.991 0.944 0.811
D2A
0.286 0.021 0.531 0.981 0.844
Juliet
1 0.163 0.750 1 NA
Dataset data
quality values
Editor's Notes
Self-Introduction. I will be presenting our paper “Data Quality for Software Vulnerability Datasets.”
Many of us have been witnessing the huge growth in AI over the last few years, and the software engineering community is no exception. Many organizations are beginning to harness the power of AI to provide intelligent tools that assist with software development and quality assurance. For instance, ChatGPT has blown away the world with its remarkable capabilities for programming and code comprehension. A properly trained model is powerful, and it allows us to effectively automate tasks that we’d otherwise find challenging or time-consuming.
Now in the software security domain, there’s actually a lot of really hard difficult time consuming tasks we’d love to automate. We’ll focus on software vulnerability detection. Vulnerabilities are security weakness in the code that can cause catastrophic consequences when exploited by attackers. The issue is however that they are hard to spot, and it can take developers years and years to review and test every single piece of code. This is where AI comes in. AI has shown much promise towards improving the automation and effectiveness of software vulnerability detection. The basic idea of these solutions is that we use historical records of vulnerability examples to train learning-based models that can automatically detect vulnerable patterns. This example here depicts a simple but dangerous buffer overflow, which we can show to our model, and after it works its magic it can theoretically spot the vulnerability in future.
Now as you may have guessed from the title, this talk isn’t actually going to be about this little amazing machine learning model here. No, it’s going to be about the data. Why? Because the data is actually rather important. A fundamental concept in computer science states that the quality of outputs of a system is dictated by the quality of its inputs. This concept is beautifully summarized by the saying “garbage in, garbage out.” The data is important.
So how do we get a nice cleanly labeled vulnerability dataset? Well this is actually extremely difficult. For traditional supervised learning problems, we might get some subject matter expert to hand label the data. But we can’t really do this for vulnerability data as it’s extremely scarce and complex. We instead use weak supervion to obtain some higher-level indicators to produce our labels. I’ll go through each of the four main ways we can do this.
Firstly, over the lifetime of a project, we naturally detect and report vulnerabilities through testing and use. For open source software, these reports are often documented in security advisories. We can attempt to trace the information contained in these reports back to the original code, and this gives us an idea of which code snippets were vulnerable.
The second approach is very similar to the last one, but rather than going through a third party vulnerability database, we can just look at the development history directly for commits describing vulnerability fixes.
However, these two sources only provide label indicators for known vulnerabilities. This means we get very small datasets in practice. This is where our third approach comes in. What if we didn’t have to wait for a developer to spot a vulnerability in order to know where it is. Well we can use some automatic tools to scan the code and tell us where the vulnerabilities. Of course this heavily relies on how reliable are tool is.
Finally, to overcome these uncertainties, we can kind of just cheat and just simply make the data up. This is called synthetic data, where we automatically create examples of code that we know to be vulnerable or not vulnerable, using known patterns.
Now none of these data collection approaches are perfect unfortunately. As each of these data sources is using relatively weak label indicators, they exhibit weakness and produce lower quality datasets than traditional supervision. But despite the importance of the data, and the difficulties we have in repairing it, we’ve found the data quality to actually be a rather ill-considered concept in software security, until now.
Hence, our goal is to gather a deep understanding of the data quality of existing software vulnerability datasets. We aim to do this for two major reasons. Firstly, our findings will help inform and raise awareness of the importance of data quality for data-driven software security research, and the impacts that data quality issues can have. Secondly, by gathering deep knowledge of the nature of data quality issues, we can learn how to prevent and overcome then. Ensuring data quality is key to enabling reliable and effective solutions for AI-based software security.
To achieve our aims, we conduct an empirical study using a simple 3 step process.
Firstly, we identify the data characteristics that we will examine. We use the ISO/IEC 25012 data quality standard to obtain 5 inherent data quality attributes: accuracy, uniqueness, consistency, completeness, and currentness. I’ll go over the definitions of these during the findings.
Secondly, we measure each of these attributes on the existing state of the art datasets. We applied a quality selection criteria to collect one dataset for each of the 4 labeling heuristics that we previously outlined. The four datasets are called Big-Vul, Devign, D2A, and the Juliet Test Suite.
Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Let’s get into it.
Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Now due to the time constraints of this presentation, I’m only going to go over our findings for the first two data attributes, but our full findings are in the paper.
It’s an expectation that when we’re working with a dataset, that the data labels are actually correct, and this is what the accuracy attribute measures. For vulnerability data we are essentially checking whether our collected vulnerabilities are actually vulnerabilities. Now to measure this, through some quite painstaking efforts, we manually examined the labeling mechanisms that assigned the data points and verified each data point as correct or not. We found that some vulnerability datasets, don’t actually do a very good job of containing vulnerabilities. The worst case is for the tool based dataset, in which only 28.6% of the data was accurate, as static analysis tools have very high false positive rates. More importantly though, these label inaccuracies have catastrophic consequences when we train the models with this data. When we evaluated our models using our manually verified data points, the performance dropped significantly, up to 80%. This is as the models are learning the wrong patterns in the training data. On the other hand, synthetic data is largely correct as the vulnerabilities are specifically crafted for these purposes, rather than collected post-hoc.
Uniqueness is defined as the degree to which there is no duplication in records. Duplication for code datasets can actually be quite common. The same piece of code can get flagged multiple times or at different stages of development. The tool-based and synthetic datasets take this to the extreme however. Only 2.1% of the dataset contained unique values in the worst case.
Duplication can be a significant problem in machine learning due to data leakage. If the validation or test set that is used to guide the learning process contains samples that the model has already seen, well its like we’re letting our model cheat on the test, and this wildly inflates the performance. We can see this in our experiments, where the model performance decreases after we remove duplicates. This is important, as we’re now getting a truer indication of our model performance.
Looking at our findings as a whole, all the examined datasets exhibited issues in various data quality aspects. Other than the synthetic dataset, none of the labeling heuristic are able to produce actually very accurate labels, which means our models are just learning the wrong things. Furthermore, the larger datasets, the ones that don’t rely on reported vulnerabilities, have huge problems with duplication and consistency. Current state of the art datasets are imperfect. What’s more, is that these issues can’t be ignored, as they have significant impacts on the tasks that rely on this data. To move towards the future, to enable data-driven intelligent methods for software security, we need to make these datasets better and overcome these challenges.