SlideShare a Scribd company logo
The University of Adelaide
Data Quality for Software Vulnerability
Datasets
Centre of Research on Engineering Software Technologies (CREST - @crest_uofa)
School of Computer Science, The University of Adelaide, Australia
Cyber Security Cooperative Research Centre, Australia
The 45th International Conference on Software Engineering (ICSE ‘23)
May 17, 2023
Roland Croft
roland.croft@adelaide.edu.au
M. Ali Babar
ali.babar@adelaide.edu.au
Mehdi Kholoosi
mehdi.kholoosi@adelaide.edu.au
Growth of AI
The University of Adelaide Slide 2
AI is beginning to shape
software development and
software quality assurance.
Software Vulnerability Prediction
The University of Adelaide Slide 3
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Software Vulnerability Prediction
The University of Adelaide Slide 4
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Data is the core
component of any
data-driven pipeline:
“Garbage In, Garbage Out”
Software Vulnerability Datasets
The University of Adelaide Slide 5
Weak
Supervision
1. Vulnerability Reports
2. Development Commit
Logs
3. Static Analysis Tools
4. Synthetic Data
Research Objective
The University of Adelaide Slide 6
Aim
Outcomes
Inform the state of software
vulnerability data quality and the
reliability of downstream tasks.
1
Enable automated data cleaning
frameworks to improve data quality
and downstream tasks.
2
To gain deep understanding into
the nature of data quality for
software vulnerability datasets.
Research Design
The University of Adelaide Slide 7
Research Design
The University of Adelaide Slide 8
Data Quality Attributes
Accuracy
1
Completeness
4
Uniqueness
2
Consistency
3
Currentness
5
Research Design
The University of Adelaide Slide 9
Labelling Heuristic: Selected Dataset:
Security Big-Vul
Developer Devign
Tool D2A
Synthetic Juliet Test Suite
Research Design
The University of Adelaide Slide 10
Inspect change in model
performance caused by
attempting to reduce data
quality issues.
Findings - Accuracy
The University of Adelaide Slide 11
“The degree to which the data has attributes that correctly represent the
true value of the intended attribute of a concept or event in a specific
context of use.”
Big-Vul 54.3%
Devign 80.0%
28.6%
D2A
100%
Juliet
Manually inspect
label correctness
-50%
Lower performance
on true labels
-29%
-80%
Findings - Uniqueness
The University of Adelaide Slide 12
“The degree to which there is no duplication in records.”
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Security Developer Tool Synthetic
Model Performance with and without
duplicates
Original No duplicates
-13.9%
-81.7%
-10.4%
Big-Vul 83.0%
Devign 89.9%
2.1%
D2A
16.3%
Juliet
Key Takeaways
The University of Adelaide Slide 13
State of the art software vulnerability datasets are imperfect.
Data quality significantly affects the performance of downstream software security
models.
We need better cleaning methods or more robust models to ensure reliability and
effective data driven software security.
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul
0.543 0.830 0.999 0.824 0.761
Devign
0.800 0.899 0.991 0.944 0.811
D2A
0.286 0.021 0.531 0.981 0.844
Juliet
1 0.163 0.750 1 NA
Dataset data
quality values

More Related Content

Similar to Data Quality for Software Vulnerability Dataset

Doing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers SeminarDoing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers Seminar
Neil Chue Hong
 
Security Data Quality Challenges
Security Data Quality ChallengesSecurity Data Quality Challenges
Security Data Quality Challenges
CREST
 
first_resume
first_resumefirst_resume
Solnet dev secops meetup
Solnet dev secops meetupSolnet dev secops meetup
Solnet dev secops meetup
pbink
 
Executing on the promise of the Internet of Things (IoT)
Executing on the promise of the Internet of Things (IoT)Executing on the promise of the Internet of Things (IoT)
Executing on the promise of the Internet of Things (IoT)
Dell World
 
Clone of an organization
Clone of an organizationClone of an organization
Clone of an organization
IRJET Journal
 
Agile methods cost of quality
Agile methods cost of qualityAgile methods cost of quality
Agile methods cost of quality
Cristiano Caetano
 
Agile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & OftenAgile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & Often
David Rico
 
Murali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_ResumeMurali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_Resume
Murali krishnan
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
DataWorks Summit
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Chakkrit (Kla) Tantithamthavorn
 
Sinha_WhitePaper
Sinha_WhitePaperSinha_WhitePaper
Sinha_WhitePaper
Mayank Sinha
 
Md Ismail_QA
Md Ismail_QAMd Ismail_QA
Md Ismail_QA
Md Ismail Sharfi
 
Data Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileData Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel File
Mehmet Gök
 
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
gerogepatton
 
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
gerogepatton
 
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
ijaia
 
Shivani jain
Shivani jainShivani jain
Shivani jain
Shivani Jain
 
AI for Software Testing Excellence in 2024
AI for Software Testing Excellence in 2024AI for Software Testing Excellence in 2024
AI for Software Testing Excellence in 2024
Testgrid.io
 
BUSTED! How to Find Security Bugs Fast!
BUSTED! How to Find Security Bugs Fast!BUSTED! How to Find Security Bugs Fast!
BUSTED! How to Find Security Bugs Fast!
Parasoft
 

Similar to Data Quality for Software Vulnerability Dataset (20)

Doing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers SeminarDoing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers Seminar
 
Security Data Quality Challenges
Security Data Quality ChallengesSecurity Data Quality Challenges
Security Data Quality Challenges
 
first_resume
first_resumefirst_resume
first_resume
 
Solnet dev secops meetup
Solnet dev secops meetupSolnet dev secops meetup
Solnet dev secops meetup
 
Executing on the promise of the Internet of Things (IoT)
Executing on the promise of the Internet of Things (IoT)Executing on the promise of the Internet of Things (IoT)
Executing on the promise of the Internet of Things (IoT)
 
Clone of an organization
Clone of an organizationClone of an organization
Clone of an organization
 
Agile methods cost of quality
Agile methods cost of qualityAgile methods cost of quality
Agile methods cost of quality
 
Agile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & OftenAgile Methods Cost of Quality: Benefits of Testing Early & Often
Agile Methods Cost of Quality: Benefits of Testing Early & Often
 
Murali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_ResumeMurali Krishnan Narayanan_Resume
Murali Krishnan Narayanan_Resume
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
 
Sinha_WhitePaper
Sinha_WhitePaperSinha_WhitePaper
Sinha_WhitePaper
 
Md Ismail_QA
Md Ismail_QAMd Ismail_QA
Md Ismail_QA
 
Data Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileData Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel File
 
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
 
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...
 
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
 
Shivani jain
Shivani jainShivani jain
Shivani jain
 
AI for Software Testing Excellence in 2024
AI for Software Testing Excellence in 2024AI for Software Testing Excellence in 2024
AI for Software Testing Excellence in 2024
 
BUSTED! How to Find Security Bugs Fast!
BUSTED! How to Find Security Bugs Fast!BUSTED! How to Find Security Bugs Fast!
BUSTED! How to Find Security Bugs Fast!
 

More from CREST

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
CREST
 
Making Software and Software Engineering visible
Making Software and Software Engineering visibleMaking Software and Software Engineering visible
Making Software and Software Engineering visible
CREST
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
CREST
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
CREST
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
CREST
 
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
CREST
 
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
CREST
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
CREST
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...
CREST
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
CREST
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
CREST
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic Review
CREST
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
CREST
 
Mod2Dash Presentation
Mod2Dash PresentationMod2Dash Presentation
Mod2Dash Presentation
CREST
 
Run-time Patching and updating Impact Estimation
Run-time Patching and updating Impact EstimationRun-time Patching and updating Impact Estimation
Run-time Patching and updating Impact Estimation
CREST
 
ECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case StudyECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case Study
CREST
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data Processing
CREST
 
Designing Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain NetworksDesigning Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain Networks
CREST
 
Privacy Engineering in the Wild
Privacy Engineering in the WildPrivacy Engineering in the Wild
Privacy Engineering in the Wild
CREST
 
CREST Overview
CREST OverviewCREST Overview
CREST Overview
CREST
 

More from CREST (20)

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
 
Making Software and Software Engineering visible
Making Software and Software Engineering visibleMaking Software and Software Engineering visible
Making Software and Software Engineering visible
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
 
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...Mining Software Repositories for Security: Data Quality Issues Lessons from T...
Mining Software Repositories for Security: Data Quality Issues Lessons from T...
 
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic Review
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
 
Mod2Dash Presentation
Mod2Dash PresentationMod2Dash Presentation
Mod2Dash Presentation
 
Run-time Patching and updating Impact Estimation
Run-time Patching and updating Impact EstimationRun-time Patching and updating Impact Estimation
Run-time Patching and updating Impact Estimation
 
ECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case StudyECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case Study
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data Processing
 
Designing Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain NetworksDesigning Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain Networks
 
Privacy Engineering in the Wild
Privacy Engineering in the WildPrivacy Engineering in the Wild
Privacy Engineering in the Wild
 
CREST Overview
CREST OverviewCREST Overview
CREST Overview
 

Recently uploaded

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Undress Baby
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 

Recently uploaded (20)

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 

Data Quality for Software Vulnerability Dataset

  • 1. The University of Adelaide Data Quality for Software Vulnerability Datasets Centre of Research on Engineering Software Technologies (CREST - @crest_uofa) School of Computer Science, The University of Adelaide, Australia Cyber Security Cooperative Research Centre, Australia The 45th International Conference on Software Engineering (ICSE ‘23) May 17, 2023 Roland Croft roland.croft@adelaide.edu.au M. Ali Babar ali.babar@adelaide.edu.au Mehdi Kholoosi mehdi.kholoosi@adelaide.edu.au
  • 2. Growth of AI The University of Adelaide Slide 2 AI is beginning to shape software development and software quality assurance.
  • 3. Software Vulnerability Prediction The University of Adelaide Slide 3 • Utilise AI to improve automation and effectiveness of vulnerability detection. • Use knowledge from previous examples to automatically learn vulnerable patterns. Previous known Vulnerabilities Machine Learning Prediction
  • 4. Software Vulnerability Prediction The University of Adelaide Slide 4 • Utilise AI to improve automation and effectiveness of vulnerability detection. • Use knowledge from previous examples to automatically learn vulnerable patterns. Previous known Vulnerabilities Machine Learning Prediction Data is the core component of any data-driven pipeline: “Garbage In, Garbage Out”
  • 5. Software Vulnerability Datasets The University of Adelaide Slide 5 Weak Supervision 1. Vulnerability Reports 2. Development Commit Logs 3. Static Analysis Tools 4. Synthetic Data
  • 6. Research Objective The University of Adelaide Slide 6 Aim Outcomes Inform the state of software vulnerability data quality and the reliability of downstream tasks. 1 Enable automated data cleaning frameworks to improve data quality and downstream tasks. 2 To gain deep understanding into the nature of data quality for software vulnerability datasets.
  • 7. Research Design The University of Adelaide Slide 7
  • 8. Research Design The University of Adelaide Slide 8 Data Quality Attributes Accuracy 1 Completeness 4 Uniqueness 2 Consistency 3 Currentness 5
  • 9. Research Design The University of Adelaide Slide 9 Labelling Heuristic: Selected Dataset: Security Big-Vul Developer Devign Tool D2A Synthetic Juliet Test Suite
  • 10. Research Design The University of Adelaide Slide 10 Inspect change in model performance caused by attempting to reduce data quality issues.
  • 11. Findings - Accuracy The University of Adelaide Slide 11 “The degree to which the data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use.” Big-Vul 54.3% Devign 80.0% 28.6% D2A 100% Juliet Manually inspect label correctness -50% Lower performance on true labels -29% -80%
  • 12. Findings - Uniqueness The University of Adelaide Slide 12 “The degree to which there is no duplication in records.” 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Security Developer Tool Synthetic Model Performance with and without duplicates Original No duplicates -13.9% -81.7% -10.4% Big-Vul 83.0% Devign 89.9% 2.1% D2A 16.3% Juliet
  • 13. Key Takeaways The University of Adelaide Slide 13 State of the art software vulnerability datasets are imperfect. Data quality significantly affects the performance of downstream software security models. We need better cleaning methods or more robust models to ensure reliability and effective data driven software security. Dataset Accuracy Uniqueness Consistency Completeness Currentness Big-Vul 0.543 0.830 0.999 0.824 0.761 Devign 0.800 0.899 0.991 0.944 0.811 D2A 0.286 0.021 0.531 0.981 0.844 Juliet 1 0.163 0.750 1 NA Dataset data quality values

Editor's Notes

  1. Self-Introduction. I will be presenting our paper “Data Quality for Software Vulnerability Datasets.”
  2. Many of us have been witnessing the huge growth in AI over the last few years, and the software engineering community is no exception. Many organizations are beginning to harness the power of AI to provide intelligent tools that assist with software development and quality assurance. For instance, ChatGPT has blown away the world with its remarkable capabilities for programming and code comprehension. A properly trained model is powerful, and it allows us to effectively automate tasks that we’d otherwise find challenging or time-consuming.  
  3. Now in the software security domain, there’s actually a lot of really hard difficult time consuming tasks we’d love to automate. We’ll focus on software vulnerability detection. Vulnerabilities are security weakness in the code that can cause catastrophic consequences when exploited by attackers. The issue is however that they are hard to spot, and it can take developers years and years to review and test every single piece of code. This is where AI comes in. AI has shown much promise towards improving the automation and effectiveness of software vulnerability detection. The basic idea of these solutions is that we use historical records of vulnerability examples to train learning-based models that can automatically detect vulnerable patterns. This example here depicts a simple but dangerous buffer overflow, which we can show to our model, and after it works its magic it can theoretically spot the vulnerability in future. 
  4. Now as you may have guessed from the title, this talk isn’t actually going to be about this little amazing machine learning model here. No, it’s going to be about the data. Why? Because the data is actually rather important.  A fundamental concept in computer science states that the quality of outputs of a system is dictated by the quality of its inputs. This concept is beautifully summarized by the saying “garbage in, garbage out.” The data is important.  
  5. So how do we get a nice cleanly labeled vulnerability dataset? Well this is actually extremely difficult. For traditional supervised learning problems, we might get some subject matter expert to hand label the data. But we can’t really do this for vulnerability data as it’s extremely scarce and complex. We instead use weak supervion to obtain some higher-level indicators to produce our labels. I’ll go through each of the four main ways we can do this.    Firstly, over the lifetime of a project, we naturally detect and report vulnerabilities through testing and use. For open source software, these reports are often documented in security advisories. We can attempt to trace the information contained in these reports back to the original code, and this gives us an idea of which code snippets were vulnerable.     The second approach is very similar to the last one, but rather than going through a third party vulnerability database, we can just look at the development history directly for commits describing vulnerability fixes.     However, these two sources only provide label indicators for known vulnerabilities. This means we get very small datasets in practice. This is where our third approach comes in. What if we didn’t have to wait for a developer to spot a vulnerability in order to know where it is. Well we can use some automatic tools to scan the code and tell us where the vulnerabilities. Of course this heavily relies on how reliable are tool is.     Finally, to overcome these uncertainties, we can kind of just cheat and just simply make the data up. This is called synthetic data, where we automatically create examples of code that we know to be vulnerable or not vulnerable, using known patterns.     Now none of these data collection approaches are perfect unfortunately. As each of these data sources is using relatively weak label indicators, they exhibit weakness and produce lower quality datasets than traditional supervision. But despite the importance of the data, and the difficulties we have in repairing it, we’ve found the data quality to actually be a rather ill-considered concept in software security, until now. 
  6. Hence, our goal is to gather a deep understanding of the data quality of existing software vulnerability datasets. We aim to do this for two major reasons. Firstly, our findings will help inform and raise awareness of the importance of data quality for data-driven software security research, and the impacts that data quality issues can have. Secondly, by gathering deep knowledge of the nature of data quality issues, we can learn how to prevent and overcome then. Ensuring data quality is key to enabling reliable and effective solutions for AI-based software security. 
  7. To achieve our aims, we conduct an empirical study using a simple 3 step process.  
  8. Firstly, we identify the data characteristics that we will examine. We use the ISO/IEC 25012 data quality standard to obtain 5 inherent data quality attributes: accuracy, uniqueness, consistency, completeness, and currentness. I’ll go over the definitions of these during the findings.  
  9. Secondly, we measure each of these attributes on the existing state of the art datasets. We applied a quality selection criteria to collect one dataset for each of the 4 labeling heuristics that we previously outlined. The four datasets are called Big-Vul, Devign, D2A, and the Juliet Test Suite.   
  10. Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Let’s get into it.   Thirdly, we validated the actual importance and relevance of each attribute for our use case of software vulnerability prediction. We took state of the art prediction models and trained them on each of our datasets. Then we see how the performance changed when we attempted to mitigate or remove the data quality issues observed. Now due to the time constraints of this presentation, I’m only going to go over our findings for the first two data attributes, but our full findings are in the paper.
  11. It’s an expectation that when we’re working with a dataset, that the data labels are actually correct, and this is what the accuracy attribute measures. For vulnerability data we are essentially checking whether our collected vulnerabilities are actually vulnerabilities. Now to measure this, through some quite painstaking efforts, we manually examined the labeling mechanisms that assigned the data points and verified each data point as correct or not. We found that some vulnerability datasets, don’t actually do a very good job of containing vulnerabilities. The worst case is for the tool based dataset, in which only 28.6% of the data was accurate, as static analysis tools have very high false positive rates. More importantly though, these label inaccuracies have catastrophic consequences when we train the models with this data. When we evaluated our models using our manually verified data points, the performance dropped significantly, up to 80%. This is as the models are learning the wrong patterns in the training data. On the other hand, synthetic data is largely correct as the vulnerabilities are specifically crafted for these purposes, rather than collected post-hoc. 
  12. Uniqueness is defined as the degree to which there is no duplication in records. Duplication for code datasets can actually be quite common. The same piece of code can get flagged multiple times or at different stages of development. The tool-based and synthetic datasets take this to the extreme however. Only 2.1% of the dataset contained unique values in the worst case.  Duplication can be a significant problem in machine learning due to data leakage. If the validation or test set that is used to guide the learning process contains samples that the model has already seen, well its like we’re letting our model cheat on the test, and this wildly inflates the performance. We can see this in our experiments, where the model performance decreases after we remove duplicates. This is important, as we’re now getting a truer indication of our model performance.  
  13. Looking at our findings as a whole, all the examined datasets exhibited issues in various data quality aspects. Other than the synthetic dataset, none of the labeling heuristic are able to produce actually very accurate labels, which means our models are just learning the wrong things. Furthermore, the larger datasets, the ones that don’t rely on reported vulnerabilities, have huge problems with duplication and consistency. Current state of the art datasets are imperfect. What’s more, is that these issues can’t be ignored, as they have significant impacts on the tasks that rely on this data. To move towards the future, to enable data-driven intelligent methods for software security, we need to make these datasets better and overcome these challenges.