SlideShare a Scribd company logo
1 of 13
The University of Adelaide
Data Quality for Software Vulnerability
Datasets
Centre of Research on Engineering Software Technologies (CREST - @crest_uofa)
School of Computer Science, The University of Adelaide, Australia
Cyber Security Cooperative Research Centre, Australia
The 45th International Conference on Software Engineering (ICSE ‘23)
May 17, 2023
Roland Croft
roland.croft@adelaide.edu.au
M. Ali Babar
ali.babar@adelaide.edu.au
Mehdi Kholoosi
mehdi.kholoosi@adelaide.edu.au
Growth of AI
The University of Adelaide Slide 2
AI is beginning to shape
software development and
software quality assurance.
Software Vulnerability Prediction
The University of Adelaide Slide 3
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Software Vulnerability Prediction
The University of Adelaide Slide 4
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Data is the core
component of any
data-driven pipeline:
“Garbage In, Garbage Out”
Software Vulnerability Datasets
The University of Adelaide Slide 5
Weak
Supervision
1. Vulnerability Reports
2. Development Commit
Logs
3. Static Analysis Tools
4. Synthetic Data
Research Objective
The University of Adelaide Slide 6
Aim
Outcomes
Inform the state of software
vulnerability data quality and the
reliability of downstream tasks.
1
Enable automated data cleaning
frameworks to improve data quality
and downstream tasks.
2
To gain deep understanding into
the nature of data quality for
software vulnerability datasets.
Research Design
The University of Adelaide Slide 7
Research Design
The University of Adelaide Slide 8
Data Quality Attributes
Accuracy
1
Completeness
4
Uniqueness
2
Consistency
3
Currentness
5
Research Design
The University of Adelaide Slide 9
Labelling Heuristic: Selected Dataset:
Security Big-Vul
Developer Devign
Tool D2A
Synthetic Juliet Test Suite
Research Design
The University of Adelaide Slide 10
Inspect change in model
performance caused by
attempting to reduce data
quality issues.
Findings - Accuracy
The University of Adelaide Slide 11
“The degree to which the data has attributes that correctly represent the
true value of the intended attribute of a concept or event in a specific
context of use.”
Big-Vul 54.3%
Devign 80.0%
28.6%
D2A
100%
Juliet
Manually inspect
label correctness
-50%
Lower performance
on true labels
-29%
-80%
Findings - Uniqueness
The University of Adelaide Slide 12
“The degree to which there is no duplication in records.”
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Security Developer Tool Synthetic
Model Performance with and without
duplicates
Original No duplicates
-13.9%
-81.7%
-10.4%
Big-Vul 83.0%
Devign 89.9%
2.1%
D2A
16.3%
Juliet
Key Takeaways
The University of Adelaide Slide 13
State of the art software vulnerability datasets are imperfect.
Data quality significantly affects the performance of downstream software security
models.
We need better cleaning methods or more robust models to ensure reliability and
effective data driven software security.
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul
0.543 0.830 0.999 0.824 0.761
Devign
0.800 0.899 0.991 0.944 0.811
D2A
0.286 0.021 0.531 0.981 0.844
Juliet
1 0.163 0.750 1 NA
Dataset data
quality values

More Related Content

Similar to Data Quality for Software Vulnerability Dataset

Agile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & OftenAgile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & Often
David Rico
 
Murali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_ResumeMurali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_Resume
Murali krishnan
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
DataWorks Summit
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Chakkrit (Kla) Tantithamthavorn
 

Similar to Data Quality for Software Vulnerability Dataset (20)

Doing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers SeminarDoing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers Seminar
 
Security Data Quality Challenges
Security Data Quality ChallengesSecurity Data Quality Challenges
Security Data Quality Challenges
 
first_resume
first_resumefirst_resume
first_resume
 
Solnet dev secops meetup
Solnet dev secops meetupSolnet dev secops meetup
Solnet dev secops meetup
 
Executing on the promise of the Internet of Things (IoT)
Executing on the promise of the Internet of Things (IoT)Executing on the promise of the Internet of Things (IoT)
Executing on the promise of the Internet of Things (IoT)
 
Clone of an organization
Clone of an organizationClone of an organization
Clone of an organization
 
Agile methods cost of quality
Agile methods cost of qualityAgile methods cost of quality
Agile methods cost of quality
 
Agile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & OftenAgile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & Often
 
Murali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_ResumeMurali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_Resume
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
 
Sinha_WhitePaper
Sinha_WhitePaperSinha_WhitePaper
Sinha_WhitePaper
 
Md Ismail_QA
Md Ismail_QAMd Ismail_QA
Md Ismail_QA
 
Data Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileData Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel File
 
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
 
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
 
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
 
Shivani jain
Shivani jainShivani jain
Shivani jain
 
AI for Software Testing Excellence in 2024
AI for Software Testing Excellence in 2024AI for Software Testing Excellence in 2024
AI for Software Testing Excellence in 2024
 
BUSTED! How to Find Security Bugs Fast!
BUSTED! How to Find Security Bugs Fast!BUSTED! How to Find Security Bugs Fast!
BUSTED! How to Find Security Bugs Fast!
 

More from CREST @ University of Adelaide

More from CREST @ University of Adelaide (20)

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
 
Making Software and Software Engineering visible
Making Software and Software Engineering visibleMaking Software and Software Engineering visible
Making Software and Software Engineering visible
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
 
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
 
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic Review
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
 
Mod2Dash Presentation
Mod2Dash PresentationMod2Dash Presentation
Mod2Dash Presentation
 
Run-time Patching and updating Impact Estimation
Run-time Patching and updating Impact EstimationRun-time Patching and updating Impact Estimation
Run-time Patching and updating Impact Estimation
 
ECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case StudyECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case Study
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data Processing
 
Designing Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain NetworksDesigning Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain Networks
 
Privacy Engineering in the Wild
Privacy Engineering in the WildPrivacy Engineering in the Wild
Privacy Engineering in the Wild
 
CREST Overview
CREST OverviewCREST Overview
CREST Overview
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 

Data Quality for Software Vulnerability Dataset

  • 1. The University of Adelaide Data Quality for Software Vulnerability Datasets Centre of Research on Engineering Software Technologies (CREST - @crest_uofa) School of Computer Science, The University of Adelaide, Australia Cyber Security Cooperative Research Centre, Australia The 45th International Conference on Software Engineering (ICSE ‘23) May 17, 2023 Roland Croft roland.croft@adelaide.edu.au M. Ali Babar ali.babar@adelaide.edu.au Mehdi Kholoosi mehdi.kholoosi@adelaide.edu.au
  • 2. Growth of AI The University of Adelaide Slide 2 AI is beginning to shape software development and software quality assurance.
  • 3. Software Vulnerability Prediction The University of Adelaide Slide 3 • Utilise AI to improve automation and effectiveness of vulnerability detection. • Use knowledge from previous examples to automatically learn vulnerable patterns. Previous known Vulnerabilities Machine Learning Prediction
  • 4. Software Vulnerability Prediction The University of Adelaide Slide 4 • Utilise AI to improve automation and effectiveness of vulnerability detection. • Use knowledge from previous examples to automatically learn vulnerable patterns. Previous known Vulnerabilities Machine Learning Prediction Data is the core component of any data-driven pipeline: “Garbage In, Garbage Out”
  • 5. Software Vulnerability Datasets The University of Adelaide Slide 5 Weak Supervision 1. Vulnerability Reports 2. Development Commit Logs 3. Static Analysis Tools 4. Synthetic Data
  • 6. Research Objective The University of Adelaide Slide 6 Aim Outcomes Inform the state of software vulnerability data quality and the reliability of downstream tasks. 1 Enable automated data cleaning frameworks to improve data quality and downstream tasks. 2 To gain deep understanding into the nature of data quality for software vulnerability datasets.
  • 7. Research Design The University of Adelaide Slide 7
  • 8. Research Design The University of Adelaide Slide 8 Data Quality Attributes Accuracy 1 Completeness 4 Uniqueness 2 Consistency 3 Currentness 5
  • 9. Research Design The University of Adelaide Slide 9 Labelling Heuristic: Selected Dataset: Security Big-Vul Developer Devign Tool D2A Synthetic Juliet Test Suite
  • 10. Research Design The University of Adelaide Slide 10 Inspect change in model performance caused by attempting to reduce data quality issues.
  • 11. Findings - Accuracy The University of Adelaide Slide 11 “The degree to which the data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use.” Big-Vul 54.3% Devign 80.0% 28.6% D2A 100% Juliet Manually inspect label correctness -50% Lower performance on true labels -29% -80%
  • 12. Findings - Uniqueness The University of Adelaide Slide 12 “The degree to which there is no duplication in records.” 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Security Developer Tool Synthetic Model Performance with and without duplicates Original No duplicates -13.9% -81.7% -10.4% Big-Vul 83.0% Devign 89.9% 2.1% D2A 16.3% Juliet
  • 13. Key Takeaways The University of Adelaide Slide 13 State of the art software vulnerability datasets are imperfect. Data quality significantly affects the performance of downstream software security models. We need better cleaning methods or more robust models to ensure reliability and effective data driven software security. Dataset Accuracy Uniqueness Consistency Completeness Currentness Big-Vul 0.543 0.830 0.999 0.824 0.761 Devign 0.800 0.899 0.991 0.944 0.811 D2A 0.286 0.021 0.531 0.981 0.844 Juliet 1 0.163 0.750 1 NA Dataset data quality values

Editor's Notes

  1. Self-Introduction. I will be presenting our paper “Data Quality for Software Vulnerability Datasets.”
  2. Many of us have been witnessing the huge growth in AI over the last few years, and the software engineering community is no exception. Many organizations are beginning to harness the power of AI to provide intelligent tools that assist with software development and quality assurance. For instance, ChatGPT has blown away the world with its remarkable capabilities for programming and code comprehension. A properly trained model is powerful, and it allows us to effectively automate tasks that we’d otherwise find challenging or time-consuming.  
  3. Now in the software security domain, there’s actually a lot of really hard difficult time consuming tasks we’d love to automate. We’ll focus on software vulnerability detection. Vulnerabilities are security weakness in the code that can cause catastrophic consequences when exploited by attackers. The issue is however that they are hard to spot, and it can take developers years and years to review and test every single piece of code. This is where AI comes in. AI has shown much promise towards improving the automation and effectiveness of software vulnerability detection. The basic idea of these solutions is that we use historical records of vulnerability examples to train learning-based models that can automatically detect vulnerable patterns. This example here depicts a simple but dangerous buffer overflow, which we can show to our model, and after it works its magic it can theoretically spot the vulnerability in future. 
  4. Now as you may have guessed from the title, this talk isn’t actually going to be about this little amazing machine learning model here. No, it’s going to be about the data. Why? Because the data is actually rather important.  A fundamental concept in computer science states that the quality of outputs of a system is dictated by the quality of its inputs. This concept is beautifully summarized by the saying “garbage in, garbage out.” The data is important.  
  5. So how do we get a nice cleanly labeled vulnerability dataset? Well this is actually extremely difficult. For traditional supervised learning problems, we might get some subject matter expert to hand label the data. But we can’t really do this for vulnerability data as it’s extremely scarce and complex. We instead use weak supervion to obtain some higher-level indicators to produce our labels. I’ll go through each of the four main ways we can do this.    Firstly, over the lifetime of a project, we naturally detect and report vulnerabilities through testing and use. For open source software, these reports are often documented in security advisories. We can attempt to trace the information contained in these reports back to the original code, and this gives us an idea of which code snippets were vulnerable.     The second approach is very similar to the last one, but rather than going through a third party vulnerability database, we can just look at the development history directly for commits describing vulnerability fixes.     However, these two sources only provide label indicators for known vulnerabilities. This means we get very small datasets in practice. This is where our third approach comes in. What if we didn’t have to wait for a developer to spot a vulnerability in order to know where it is. Well we can use some automatic tools to scan the code and tell us where the vulnerabilities. Of course this heavily relies on how reliable are tool is.     Finally, to overcome these uncertainties, we can kind of just cheat and just simply make the data up. This is called synthetic data, where we automatically create examples of code that we know to be vulnerable or not vulnerable, using known patterns.     Now none of these data collection approaches are perfect unfortunately. As each of these data sources is using relatively weak label indicators, they exhibit weakness and produce lower quality datasets than traditional supervision. But despite the importance of the data, and the difficulties we have in repairing it, we’ve found the data quality to actually be a rather ill-considered concept in software security, until now. 
  6. Hence, our goal is to gather a deep understanding of the data quality of existing software vulnerability datasets. We aim to do this for two major reasons. Firstly, our findings will help inform and raise awareness of the importance of data quality for data-driven software security research, and the impacts that data quality issues can have. Secondly, by gathering deep knowledge of the nature of data quality issues, we can learn how to prevent and overcome then. Ensuring data quality is key to enabling reliable and effective solutions for AI-based software security. 
  7. To achieve our aims, we conduct an empirical study using a simple 3 step process.  
  8. Firstly, we identify the data characteristics that we will examine. We use the ISO/IEC 25012 data quality standard to obtain 5 inherent data quality attributes: accuracy, uniqueness, consistency, completeness, and currentness. I’ll go over the definitions of these during the findings.  
  9. Secondly, we measure each of these attributes on the existing state of the art datasets. We applied a quality selection criteria to collect one dataset for each of the 4 labeling heuristics that we previously outlined. The four datasets are called Big-Vul, Devign, D2A, and the Juliet Test Suite.   
  10. Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Let’s get into it.   Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Now due to the time constraints of this presentation, I’m only going to go over our findings for the first two data attributes, but our full findings are in the paper.
  11. It’s an expectation that when we’re working with a dataset, that the data labels are actually correct, and this is what the accuracy attribute measures. For vulnerability data we are essentially checking whether our collected vulnerabilities are actually vulnerabilities. Now to measure this, through some quite painstaking efforts, we manually examined the labeling mechanisms that assigned the data points and verified each data point as correct or not. We found that some vulnerability datasets, don’t actually do a very good job of containing vulnerabilities. The worst case is for the tool based dataset, in which only 28.6% of the data was accurate, as static analysis tools have very high false positive rates. More importantly though, these label inaccuracies have catastrophic consequences when we train the models with this data. When we evaluated our models using our manually verified data points, the performance dropped significantly, up to 80%. This is as the models are learning the wrong patterns in the training data. On the other hand, synthetic data is largely correct as the vulnerabilities are specifically crafted for these purposes, rather than collected post-hoc. 
  12. Uniqueness is defined as the degree to which there is no duplication in records. Duplication for code datasets can actually be quite common. The same piece of code can get flagged multiple times or at different stages of development. The tool-based and synthetic datasets take this to the extreme however. Only 2.1% of the dataset contained unique values in the worst case.  Duplication can be a significant problem in machine learning due to data leakage. If the validation or test set that is used to guide the learning process contains samples that the model has already seen, well its like we’re letting our model cheat on the test, and this wildly inflates the performance. We can see this in our experiments, where the model performance decreases after we remove duplicates. This is important, as we’re now getting a truer indication of our model performance.  
  13. Looking at our findings as a whole, all the examined datasets exhibited issues in various data quality aspects. Other than the synthetic dataset, none of the labeling heuristic are able to produce actually very accurate labels, which means our models are just learning the wrong things. Furthermore, the larger datasets, the ones that don’t rely on reported vulnerabilities, have huge problems with duplication and consistency. Current state of the art datasets are imperfect. What’s more, is that these issues can’t be ignored, as they have significant impacts on the tasks that rely on this data. To move towards the future, to enable data-driven intelligent methods for software security, we need to make these datasets better and overcome these challenges.