Data Quality for Software Vulnerability Dataset

•Download as PPTX, PDF•

0 likes•36 views

This project explores data quality for software vulnerability datasets, and provides solutions for automated data cleaning frameworks to improve data quality and downstream tasks.

Software

Software Vulnerability Prediction
The University of Adelaide Slide 3
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction

Software Vulnerability Prediction
The University of Adelaide Slide 4
• Utilise AI to improve automation and effectiveness of vulnerability detection.
• Use knowledge from previous examples to automatically learn vulnerable patterns.
Previous known Vulnerabilities
Machine Learning
Prediction
Data is the core
component of any
data-driven pipeline:
“Garbage In, Garbage Out”

Software Vulnerability Datasets
The University of Adelaide Slide 5
Weak
Supervision
1. Vulnerability Reports
2. Development Commit
Logs
3. Static Analysis Tools
4. Synthetic Data

Research Objective
The University of Adelaide Slide 6
Aim
Outcomes
Inform the state of software
vulnerability data quality and the
reliability of downstream tasks.
1
Enable automated data cleaning
frameworks to improve data quality
and downstream tasks.
2
To gain deep understanding into
the nature of data quality for
software vulnerability datasets.

Research Design
The University of Adelaide Slide 7

Research Design
The University of Adelaide Slide 8
Data Quality Attributes
Accuracy
1
Completeness
4
Uniqueness
2
Consistency
3
Currentness
5

Research Design
The University of Adelaide Slide 9
Labelling Heuristic: Selected Dataset:
Security Big-Vul
Developer Devign
Tool D2A
Synthetic Juliet Test Suite

Research Design
The University of Adelaide Slide 10
Inspect change in model
performance caused by
attempting to reduce data
quality issues.

Findings - Accuracy
The University of Adelaide Slide 11
“The degree to which the data has attributes that correctly represent the
true value of the intended attribute of a concept or event in a specific
context of use.”
Big-Vul 54.3%
Devign 80.0%
28.6%
D2A
100%
Juliet
Manually inspect
label correctness
-50%
Lower performance
on true labels
-29%
-80%

Findings - Uniqueness
The University of Adelaide Slide 12
“The degree to which there is no duplication in records.”
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Security Developer Tool Synthetic
Model Performance with and without
duplicates
Original No duplicates
-13.9%
-81.7%
-10.4%
Big-Vul 83.0%
Devign 89.9%
2.1%
D2A
16.3%
Juliet

Key Takeaways
The University of Adelaide Slide 13
State of the art software vulnerability datasets are imperfect.
Data quality significantly affects the performance of downstream software security
models.
We need better cleaning methods or more robust models to ensure reliability and
effective data driven software security.
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul
0.543 0.830 0.999 0.824 0.761
Devign
0.800 0.899 0.991 0.944 0.811
D2A
0.286 0.021 0.531 0.981 0.844
Juliet
1 0.163 0.750 1 NA
Dataset data
quality values

Similar to Data Quality for Software Vulnerability Dataset

Doing Science Properly In The Digital Age - Rutgers Seminar

Neil Chue Hong

Security Data Quality Challenges

CREST @ University of Adelaide

first_resume

Tirumala Reddy Konireddy

Solnet dev secops meetup

pbink

As sensors spread across almost every industry, the Internet of Things is triggering a massive influx of data. Data is coming from all directions – machinery, train tracks, shipping containers, and power stations. As we go from isolated systems to an integrated network of smart devices, enterprises need to develop smart data integration and analytics techniques to generate insights quickly. Not all data collected from sensors needs to be stored and analyzed in the cloud or data center. This session will discuss smart ways of integrating multiple data sources and using analytics techniques at the edge to enable faster decision making.

Executing on the promise of the Internet of Things (IoT)

Dell World

Clone of an organization

IRJET Journal

Agile methods cost of quality

Cristiano Caetano

Agile Methods Cost of Quality: Benefits of Testing Early & Often

David Rico

Murali Krishnan Narayanan_Resume

Murali krishnan

Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise

DataWorks Summit

In today's increasingly digitalised world, software defects are enormously expensive. In 2018, the Consortium for IT Software Quality reported that software defects cost the global economy $2.84 trillion dollars and affected more than 4 billion people. The average annual cost of software defects on Australian businesses is A$29 billion per year. Thus, failure to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters. Traditionally, software quality assurance activities like testing and code review are widely adopted to discover software defects in a software product. However, ultra-large-scale systems, such as, Google, can consist of more than two billion lines of code, so exhaustively reviewing and testing every single line of code isn't feasible with limited time and resources. This project aims to create technologies that enable software engineers to produce the highest quality software systems with the lowest operational costs. To achieve this, this project will invent an end-to-end explainable AI platform to (1) understand the nature of critical defects; (2) predict and locate defects; (3) explain and visualise the characteristics of defects; (4) suggest potential patches to automatically fix defects; (5) integrate such platform as a GitHub bot plugin.

Explainable Artificial Intelligence (XAI)  to Predict and Explain Future Soft...

Chakkrit (Kla) Tantithamthavorn

Sinha_WhitePaper

Mayank Sinha

Md Ismail_QA

Md Ismail Sharfi

Data Driven Testing Is More Than an Excel File

Mehmet Gök

The history of Artificial Intelligence and Machine Learning dates back to 1950’s. In recent years, there has been an increase in popularity for applications that implement AI and ML technology. As with traditional development, software testing is a critical component of an efficient AI/ML application. However, the approach to development methodology used in AI/ML varies significantly from traditional development. Owing to these variations, numerous software testing challenges occur. This paper aims to recognize and to explain some of the biggest challenges that software testers face in dealing with AI/ML applications. For future research, this study has key implications. Each of the challenges outlined in this paper is ideal for further investigation and has great potential to shed light on the way to more productive software testing strategies and methodologies that can be applied to AI/ML applications.

Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...

gerogepatton

SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...

ijaia

Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...

gerogepatton

Shivani jain

Shivani Jain

AI for Software Testing Excellence in 2024

Testgrid.io

This presentation explores how busting software bugs does more than ensure the reliability and performance of your software—it helps ensure application security. Topics covered include: How AppSec processes are really quality processes How software bugs are really security vulnerabilities How to apply coding standards as part of a continuous testing process to prevent defects from affecting the safety, security, and reliability of your applications

BUSTED! How to Find Security Bugs Fast!

Parasoft

Similar to Data Quality for Software Vulnerability Dataset (20)

Doing Science Properly In The Digital Age - Rutgers Seminar

Security Data Quality Challenges

first_resume

Solnet dev secops meetup

Executing on the promise of the Internet of Things (IoT)

Clone of an organization

Agile methods cost of quality

Agile Methods Cost of Quality: Benefits of Testing Early & Often

Murali Krishnan Narayanan_Resume

Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise

Explainable Artificial Intelligence (XAI)  to Predict and Explain Future Soft...

Sinha_WhitePaper

Md Ismail_QA

Data Driven Testing Is More Than an Excel File

Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...

SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...

Software Testing: Issues and Challenges of Artificial Intelligence & Machine ...

Shivani jain

AI for Software Testing Excellence in 2024

BUSTED! How to Find Security Bugs Fast!

More from CREST @ University of Adelaide

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...

CREST @ University of Adelaide

Making Software and Software Engineering visible

CREST @ University of Adelaide

Some key takeaways from this presentation are listed below: Software Architecture Plays a Vital Role in Design and Evolution of Cloud-Based Systems • Rapid Adoption of Cloud Computing has Created Huge Gap in Software Architecture Design Knowledge that can Result in Technical Debts • Dozens of Architectural Related Challenges in Designing & Evaluating Cloud-Based Systems • Systematically Building and Leveraging Architectural Design Knowledge is Important for Developing on or Migrating to Clouds

Understanding and Addressing Architectural Challenges of Cloud- Based Systems

CREST @ University of Adelaide

Some key takeaways from this talk are outlined below. The main focus area for researchers in DevSecOps is automation and tool usage. Older technologies, such as SAST & DAST tools have drawbacks that affect DevSecOps goals. Shift-left security and continuous security assessment are two key recommendations. These practices prioritise security in a continuous manner throughout the deployment cycle. Inability to automate traditionally manual security practices is a significant problem in this field. These practices are hard to be fully integrated with the continuous practices of DevOps. Even though cultural or human aspects are critical for DevSecOps success, these has not been much done in the state-of-the-art and the state-of-the-practice domains Adopting DevSecOps principles or practices in various complex, resource-constrained, and highly regulated infrastructures is a growing area of research. More empirically evaluated solutions are needed to ensure wider adoption of such tools or frameworks

DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...

CREST @ University of Adelaide

A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching

CREST @ University of Adelaide

This presentation highlights a range of issues that arise when dealing with data quality, and poses several recommendations, including: Consideration of Label Noise in Negative Class • Semi-Supervised, e.g., self-training, positive or Unlabeled training on unlabeled set • Consideration of Timeliness • Currently labeled data & more positive samples; Preserve data sequence for training • Use of Data Visualization • Try to achieve better data understandability for non data scientists • Creation and Use of Diverse Language Datasets • Bug seeding into semantically similar languages • Use of Data Quality Assessment Criteria • Determine and use specific data quality assessment approaches • Better Data Sharing and Governance

Mining Software Repositories for Security: Data Quality Issues Lessons from T...

CREST @ University of Adelaide

A Decentralised Platform for Provenance Management of Machine Learning Softwa...

CREST @ University of Adelaide

Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...

CREST @ University of Adelaide

Falling for Phishing: An Empirical Investigation into People's Email Response...

CREST @ University of Adelaide

An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...

CREST @ University of Adelaide

Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...

CREST @ University of Adelaide

Detecting Misuses of Security APIs: A Systematic Review

CREST @ University of Adelaide

Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...

CREST @ University of Adelaide

Mod2Dash Presentation

CREST @ University of Adelaide

Run-time Patching and updating Impact Estimation

CREST @ University of Adelaide

ECSA 2023 Ubuntu Case Study

CREST @ University of Adelaide

Energy Efficiency Evaluation of Local and Offloaded Data Processing

CREST @ University of Adelaide

Designing Quality-Driven Blockchain Networks

CREST @ University of Adelaide

Privacy Engineering in the Wild

CREST @ University of Adelaide

CREST Overview

CREST @ University of Adelaide

More from CREST @ University of Adelaide (20)