SlideShare a Scribd company logo
1 of 33
Download to read offline
Mining Software Repositories for Security: Data
Quality Issues Lessons from Trenches
Ali Babar
CREST – Centre for Research on Engineering Software Technologies
MSR4PS Workshop, 18th November, 2022
Brief Bio
M. Ali Babar
• Professor, School of Computer Science, University of
Adelaide, Australia – Nov. 2013 -
• Founding Lead – The Centre for Research on
Software Technologies (CREST) – Nov 2013 –
• Theme Lead – Platforms and Architectures for
Security as Service, Cyber Security Cooperative
Research Centre (CSCRC)
• For current research areas: please visit CREST
website: crest-centre.net
Previous Work History
• Senior research and academic positions in UK, Denmark
& Ireland
Trustworthy Digital Services
Data
Science
Integration &
Interoperability
Autonomy
Cyber
Security
Artificial
intelligence
Software System Engineering
DevOps
Peopl e, Pr oc es ses , and Tool s
MDE Design Space
Socio-
T
echnical
Technol ogies
IoT / CPS Cloudlet Cloud Blockchain
Applications
Domains
Health
Systems
AgFood
Systems
Defence
Systems
• Software Security and MSR
• Trends & Examples of Leveraging MSR for Security
• Some of the Challenges of MSR Research
• Data Centric Research of Our Group
• Data Quality Problems Experienced/Observed
• Some Recommendations to Deal with Challenges
Talk’s Roadmap
• Approximately 90% of cyber incidents are caused by the
vulnerabilities rooted in software – proprietary or sourced
• Software Bill Of Material (SBOM) is becoming ineffective in
answering critical questions
• Q1: Do we really know what’s in software coming into the organisation?
• Q2: How do we establish trust and preserve the security of software coming
into the organisation?
Software Security and MSR
Trends of Leveraging MSR4S
Research Domain Data Sources Goals Potential Benefits
Malware & Malicious Codes • Source code
• Software Packages
• Commits
• Posts & Q&A &Discussion
• Raising Awareness
• Increasing Data
availability & Quality
• Detection vulnerabilities
• Fix & Prevention
• Automation of
summarization
• High accuracy
• Intelligent tools
• Timeliness
• Low cost (Based on
public available data
source)
Vulnerability & Bug • Source code
• Software Packages
• Commits
• Issues & Bug Reports
• Social media Posts & Discussion
Software Supply Chain
Security
• Software Packages
• Pre-trained models toolkit
• Package Manager
License Compliance • Source code
• License Agreements
Privacy • Source code
• Tool Chains
Secure Coding (Coding
Standards and Best Practices)
Detection
• Source code
• Commits
Trends of MSR for Security & Privacy Research
• Automated vulnerability assessment for third-party
app/library patching (exploitability, impact, and severity
based on CVSS (Common Vulnerability Scoring System))
using ML models
• National Vulnerabilities Database (NVD)
• Automated software vulnerability assessment with concept drift,
MSR ‘19
• Unveiling developers' key challenges & solutions when
addressing vulnerabilities in practice
• Q&A websites (Stack Overflow & Security StackExchange
• A large-scale study of security vulnerability support on developer
Q&A websites, EASE '21
8
MSR4S - Single Source of Data
• Predicting vulnerabilities in code functions and
statements
• LineVD: statement-level vulnerability detection using
graph neural networks, MSR ‘22
• Comparison learning-based and rule-based techniques
for vulnerability prediction
• An empirical study of rule-based and learning-based
approaches for static application security testing, ESEM
'21
• NVD (Reported fixing commits) + GitHub (Source code of commits)
9
MSR4S – Multiple Sources of Data
• Automated prior-fixing vulnerability assessment
(vulnerability exploitability, impact, and severity based
on CVSS) in code functions using ML models
• DeepCVA: automated commit-level vulnerability assessment with deep
multi-task learning, ASE '21
• Automated just-in-time vulnerability assessment
(exploitability, impact, and severity based on CVSS) in code
commits using ML models
• On the use of fine-grained vulnerable code statements for software
vulnerability assessment models, MSR ’22
• NVD (Reported fixing commits & CVSS) + GitHub
(Source code of commits)
10
MSR4S – Multiple Sources of Data
• Data Extraction Phase
• Limited support for automated extraction
• Keeping the extracted data up to data
• Keeping the data scraper up to date
• Data Pre-processing
• Different version control systems (e.g., git, CVS, SVN)
• Lack of standard in tools for bug reporting and data collection
• Data quality
• Data Integration Phase
• Correlating data points across technical domains
• Linking data from different resources (e.g GitHub data with stack overflow posts)
Challenges in MSR Research
Data Centric R&D for Software Security
Data-Driven Software Security at CREST
Software Vulnerability
Prediction
Software Vulnerability
Assessment &
Prioritisation
Software Vulnerability Knowledge Support
1. A survey on data-driven software vulnerability assessment and prioritization (CSUR '22)
2. On the use of fine-grained vulnerable code statements for software vulnerability
assessment models (MSR '22)
3. An investigation into inconsistency of software vulnerability severity across data sources
(SANER '22)
4. DeepCVA: Automated commit-level vulnerability assessment with deep multi-task
learning (ASE '21)
5. Automated software vulnerability assessment with concept drift (MSR '19)
1. LineVD: statement-level vulnerability detection using graph
neural networks (MSR '22)
2. Noisy label learning for security defects (MSR '22)
3. KGSecConfig: A Knowledge Graph Based Approach for
Secured Container Orchestrator Configuration (SANER '22)
4. An empirical study of rule-based and learning-based
approaches for static application security testing (ESEM '21)
1. An empirical study of developers’ discussions about security challenges of different programming languages (EMSE '22)
2. Well begun is half done: an empirical study of exploitability & impact of base-image vulnerabilities (SANER '22)
3. PUMiner: Mining security posts from developer question and answer websites with PU learning (MSR '20)
4. A large-scale study of security vulnerability support on developer Q&A websites (EASE '21)
Software Vulnerability
Prediction
Software Vulnerability
Assessment &
Prioritisation
DATA
Software Vulnerability Knowledge Support
Data-Driven Software Security at CREST
Data Quality Challenges in MSR4S
Data Quality for Data-Driven Software Security
DATA
Challenges (What, Why, So What)
Recommendations (Dos & Donts)
Security Data Challenges
Scarcity (In)Accuracy (Ir)Relevance Redundancy (Mis)Coverage
Non-
Representative-
ness
Drift (In)Accessibility (Re)Use Maliciousness
Info
(What)
Cause
(Why)
Impact
(So What)
Scarcity (In)Accuracy (Ir)Relevance Redundancy (Mis)Coverage
Non-
Representative-
ness
Drift (In)Accessibility (Re)Use Maliciousness
Security Data Challenges
Data Scarcity
• Hundreds/Thousands security issues vs. Million images
• Security issues < 10% of all reported issues (even worse
for new projects)
Info
Causes
• Lack of explicit labeling/understanding of security issues
• Imperfect data collection (precision vs. recall vs. effort)
• Rare occurence of certain types of security issues
Impacts
• Leading to imbalanced data
• Lacking data to train high-performing ML models
• More data beats a cleverer algorithm
R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447.
T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories (MSR), 2020, 350-361.
Data (In)Accuracy
• (Non-)Security issues not labelled as such
• FN: Critical vulnerabilities unfixed for long time (> 3 yrs)
• FP: Wasting inspection effort
Info
Causes
• We don't know what we don't know (unknown unknowns)
• Lack of reporting or silent patch of security issues
• Tangled changes (fixing non-sec. & sec. issues together)
Impacts
• Criticality: Systematic labeling errors > random errors
• Making ML models learn the wrong patterns
• Introducing backdoors of ML models
R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Without latent SVs With latent SVs
Data (Ir)Relevance
• Not all input is useful for predicting security issues
• Ex1: Code comments for predicting vulnerabilities?!
• Ex2: A file containing fixed version update is vulnerable?!
Info
Causes
• Lack of data exploratory analysis
• Lack of domain expertise (e.g., NLPers working in SSE)
• Trying to beat that state-of-the-art
Impacts
• Negatively affecting the construct validity
• Leading to model overfitting
Security fixes (87c89f0) in theApache jspwiki project
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Data Redundancy
• Same security issues found across different software versions,
branches, & even projects
Info
Causes
• Cloned projects from mature projects (e.g., Linux kernel)
• Merged code from feature branches into the master branch
• Renamed files/functions with the same code content
• Cosmetic-only changes (different white spaces or new lines)
Impacts
• Limiting learning capabilities of ML models
• Leading to bias and overfitting for ML models
• Inflating model performance (same training/testing samples)
Thousands of cloned projects
sharing same vulns as the Linux kernel
Vulnerability prediction performance before & after
removing the redundancies in common datasets
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
R. Croft, et al., Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, 1-
12.
Data (Mis)Coverage
• Security issues spanning multiple lines, functions, files,
modules, versions, & even projects, but...
• Current approaches mostly focus on single function/file
Info
Causes
• Partial security fixes in multiple versions
• For ML: coverage vs. size (
↑
as granularity ↓
)
• For Devs: coverage vs. convenience (inspection effort)
• Fixed-size (truncated) input required for some (DL) models
Impacts
• Lacking context for training ML models to detect complex
(e.g., intra-function/file/module) security issues
• Incomplete data for ML as security-related info. truncated
User's malicious input?
How to know using only
the current function?
Deleted line
Added line
Func_a1 Func_b1
File b
Module A
File a
Func_a2 Func_b2
Func_b3
Assume that ModuleA
is vulnerable then:
• Vuln. module: 1
• Vuln. files: 2
• Vuln. functions: 5
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Cylinder vs. Circle vs. Rectangle?
Data (Non-)Representativeness
• Real-world security issues (e.g., NVD) vastly different from
synthetic ones (e.g., SARD)
• Varying characteristics of security issues across projects
Info
Causes
• Synthetic data: local & created by pre-defined rules
• Real-world data: inter-dependent & complex
• Different features & nature between apps
Impacts
• Perf. (F1) gap: Synthetic (0.85) vs. Real-world (0.15)
• Lack of generalisability & transferability of ML models
• Within-project prediction > Cross-project prediction
R. Croft, et al., IEEE Transactions on Software Engineering, 2022.
S. Chakraborty, et al., IEEE Transactions on Software Engineering, 2021.
Real-world
data
Synthetic
data
Data Drift
• Unending battle between attackers & defenders
• Evolving threat landscapes  Changing characteristics
of security issues over time
Info
Impacts
• Out-of-Vocabulary words  Degrading performance
• Data leakage  Unrealistic performance (up to ~5 times
overfitting) using non-temporal evaluation technique
R. Croft, et al., Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, 338-348.
T. H. M. Le, et al., Proceedings of the 16th International Conference on Mining Software Repositories, 2019, 371-382.
• New terms for emerging attacks, defenses, & issues
• Changing software features & implementation over time
Causes
New threats/affected products in NVD over time
Test
Non-temporal evaluation
 (Future) data leakage
Data (In)Accessibility
• Security data not always shared
• Shared yet incomplete data  Re-collected data may be
different from original data
Info
Causes
• Privacy concerns (e.g., commercial projects) or not?!
• Too large size for storage (artifact ID vs. artifact content)
• Data values can change over time (e.g., devs' experience)
Impacts
• Limited reproducibility of the existing results
• Limited generalisability of the ML models (e.g., open-
source vs. closed-source data)
"We have anonymously uploaded the database
to https://www.dropbox.com/s/anonymised_link
so the reviewers can access the raw data
during the review process. We will release the
data to the community together with the paper."
Click the link
R. Croft, et al., Empirical Software Engineering, 2022, 27: 1-52.
T. H. M. Le, et al., Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2021, 109-118.
Data (Re-)Use
• Finding a needle (security issue) in a haystack (raw data)
• And ... haystack can be huge
• Reuse existing data > Update/collect new data
Info
• Completely trusting/relying on existing datasets
• Unsuitable infrastructure to collect/store raw data
Causes
Impacts
• Making ML models become obsolete & less generalised
• Being unaware of the issues in the current data
 Error propagation in following studies
1,229 VCCs
(> 1TB)
200 real-world
Java projects
Clone
projects
~70k security posts
~20 mil posts (> 50 GB)
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories, 2020, 350-361.
Data Maliciousness
• Threat/security data is itself a threat (e.g., new vulns)
• Using/sharing threat data without precautions
Info
Causes
• Simply an oversight / afterthought
• Private experiment vs. Public disclosure (Big difference!)
• Open science vs. Responsible science
Impacts
• Violating ethics of the community
• Shared security data maybe exploited by attackers
• Affecting maintainers & users of target systems
The identified vulns later get exploited by attackers
They move on to submit & publish the paper
They don't report the vulnerabilities to project
maintainers to fix
They identify new vulnerabilities using the model
Researchers develop a SOTAML-based
vulnerability prediction model
An example of malicious/irresponsible data sharing
Some Recommendations
• Consideration of Label Noise in Negative Class
• Semi-Supervised, e.g., self-training, positive or Unlabeled training on unlabeled set
• Consideration of Timeliness
• Currently labeled data & more positive samples; Preserve data sequence for training
• Use of Data Visualization
• Try to achieve better data understandability for non data scientists
• Creation and Use of Diverse Language Datasets
• Bug seeding into semantically similar languages
• Use of Data Quality Assessment Criteria
• Determine and use specific data quality assessment approaches
• Better Data Sharing and Governance
• Provide exact details and processes of data preparation
Recommendations for Dealing with Data Quality Issues
Acknowledgements
• This talk is based on the research studies carried out by the CREST researchers,
particularly by Roland Croft, Triet Le, Yongzheng Xie, Mehdi Kholoosi.
• Triet Le led the efforts for developing the content for this presentation
• Discussions in the SSI cluster of the CREST provided insights included in this presentation
• Our research partially funded by the Cybersecurity CRC
• We are grateful to the students, RAs and Open Source data communities
CRICOS 00123M
Contact: Ali Babar
ali.babar@adelaide.edu.au

More Related Content

Similar to Mining Software Repositories for Security: Data Quality Issues Lessons from Trenches

Cyber security for system design
Cyber security for system designCyber security for system design
Cyber security for system designTom Kaczmarek
 
How to Get Started with DevSecOps
How to Get Started with DevSecOpsHow to Get Started with DevSecOps
How to Get Started with DevSecOpsCYBRIC
 
ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...
ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...
ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...Asma Swapna
 
Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017
Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017
Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017Amazon Web Services
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationRaffael Marty
 
Filling your AppSec Toolbox - Which Tools, When to Use Them, and Why
Filling your AppSec Toolbox - Which Tools, When to Use Them, and WhyFilling your AppSec Toolbox - Which Tools, When to Use Them, and Why
Filling your AppSec Toolbox - Which Tools, When to Use Them, and WhyBlack Duck by Synopsys
 
Applying Auto-Data Classification Techniques for Large Data Sets
Applying Auto-Data Classification Techniques for Large Data SetsApplying Auto-Data Classification Techniques for Large Data Sets
Applying Auto-Data Classification Techniques for Large Data SetsPriyanka Aash
 
Secure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data ProcessingSecure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data ProcessingShantanu Sharma
 
SCS DevSecOps Seminar - State of DevSecOps
SCS DevSecOps Seminar - State of DevSecOpsSCS DevSecOps Seminar - State of DevSecOps
SCS DevSecOps Seminar - State of DevSecOpsStefan Streichsbier
 
TechTalk 2021: Peran IT Security dalam Penerapan DevOps
TechTalk 2021: Peran IT Security dalam Penerapan DevOpsTechTalk 2021: Peran IT Security dalam Penerapan DevOps
TechTalk 2021: Peran IT Security dalam Penerapan DevOpsDicodingEvent
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecurityTao Xie
 
Cloud Security vs Security in the Cloud
Cloud Security vs Security in the CloudCloud Security vs Security in the Cloud
Cloud Security vs Security in the CloudTjylen Veselyj
 
[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principles[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principlesOWASP
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataIGEEKS TECHNOLOGIES
 
Software Security in the Real World
Software Security in the Real WorldSoftware Security in the Real World
Software Security in the Real WorldMark Curphey
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...CREST @ University of Adelaide
 

Similar to Mining Software Repositories for Security: Data Quality Issues Lessons from Trenches (20)

Cyber security for system design
Cyber security for system designCyber security for system design
Cyber security for system design
 
How to Get Started with DevSecOps
How to Get Started with DevSecOpsHow to Get Started with DevSecOps
How to Get Started with DevSecOps
 
ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...
ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...
ICCES_2016_Security Analysis of Software Defined Wireless Network Monitoring ...
 
Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017
Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017
Security and DevOps: Agility and Teamwork - SID315 - re:Invent 2017
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and Visualization
 
Filling your AppSec Toolbox - Which Tools, When to Use Them, and Why
Filling your AppSec Toolbox - Which Tools, When to Use Them, and WhyFilling your AppSec Toolbox - Which Tools, When to Use Them, and Why
Filling your AppSec Toolbox - Which Tools, When to Use Them, and Why
 
Applying Auto-Data Classification Techniques for Large Data Sets
Applying Auto-Data Classification Techniques for Large Data SetsApplying Auto-Data Classification Techniques for Large Data Sets
Applying Auto-Data Classification Techniques for Large Data Sets
 
Secure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data ProcessingSecure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data Processing
 
Security for developers
Security for developersSecurity for developers
Security for developers
 
SCS DevSecOps Seminar - State of DevSecOps
SCS DevSecOps Seminar - State of DevSecOpsSCS DevSecOps Seminar - State of DevSecOps
SCS DevSecOps Seminar - State of DevSecOps
 
DITEC JAN 31 2015 (PDF)
DITEC JAN 31 2015 (PDF)DITEC JAN 31 2015 (PDF)
DITEC JAN 31 2015 (PDF)
 
TechTalk 2021: Peran IT Security dalam Penerapan DevOps
TechTalk 2021: Peran IT Security dalam Penerapan DevOpsTechTalk 2021: Peran IT Security dalam Penerapan DevOps
TechTalk 2021: Peran IT Security dalam Penerapan DevOps
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
 
Cloud Security vs Security in the Cloud
Cloud Security vs Security in the CloudCloud Security vs Security in the Cloud
Cloud Security vs Security in the Cloud
 
[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principles[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principles
 
TRUSTSeminar.ppt
TRUSTSeminar.pptTRUSTSeminar.ppt
TRUSTSeminar.ppt
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
Software Security in the Real World
Software Security in the Real WorldSoftware Security in the Real World
Software Security in the Real World
 
Bluehat 2019 mlsec talk
Bluehat 2019 mlsec talkBluehat 2019 mlsec talk
Bluehat 2019 mlsec talk
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
 

More from CREST @ University of Adelaide

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...CREST @ University of Adelaide
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsCREST @ University of Adelaide
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingCREST @ University of Adelaide
 
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...CREST @ University of Adelaide
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...CREST @ University of Adelaide
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...CREST @ University of Adelaide
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...CREST @ University of Adelaide
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...CREST @ University of Adelaide
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewCREST @ University of Adelaide
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...CREST @ University of Adelaide
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingCREST @ University of Adelaide
 

More from CREST @ University of Adelaide (19)

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
 
Making Software and Software Engineering visible
Making Software and Software Engineering visibleMaking Software and Software Engineering visible
Making Software and Software Engineering visible
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
 
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic Review
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
 
Data Quality for Software Vulnerability Dataset
Data Quality for Software Vulnerability DatasetData Quality for Software Vulnerability Dataset
Data Quality for Software Vulnerability Dataset
 
Mod2Dash Presentation
Mod2Dash PresentationMod2Dash Presentation
Mod2Dash Presentation
 
Run-time Patching and updating Impact Estimation
Run-time Patching and updating Impact EstimationRun-time Patching and updating Impact Estimation
Run-time Patching and updating Impact Estimation
 
ECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case StudyECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case Study
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data Processing
 
Designing Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain NetworksDesigning Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain Networks
 
Privacy Engineering in the Wild
Privacy Engineering in the WildPrivacy Engineering in the Wild
Privacy Engineering in the Wild
 
CREST Overview
CREST OverviewCREST Overview
CREST Overview
 

Recently uploaded

Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 

Recently uploaded (20)

Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 

Mining Software Repositories for Security: Data Quality Issues Lessons from Trenches

  • 1. Mining Software Repositories for Security: Data Quality Issues Lessons from Trenches Ali Babar CREST – Centre for Research on Engineering Software Technologies MSR4PS Workshop, 18th November, 2022
  • 2. Brief Bio M. Ali Babar • Professor, School of Computer Science, University of Adelaide, Australia – Nov. 2013 - • Founding Lead – The Centre for Research on Software Technologies (CREST) – Nov 2013 – • Theme Lead – Platforms and Architectures for Security as Service, Cyber Security Cooperative Research Centre (CSCRC) • For current research areas: please visit CREST website: crest-centre.net Previous Work History • Senior research and academic positions in UK, Denmark & Ireland
  • 3. Trustworthy Digital Services Data Science Integration & Interoperability Autonomy Cyber Security Artificial intelligence Software System Engineering DevOps Peopl e, Pr oc es ses , and Tool s MDE Design Space Socio- T echnical Technol ogies IoT / CPS Cloudlet Cloud Blockchain Applications Domains Health Systems AgFood Systems Defence Systems
  • 4. • Software Security and MSR • Trends & Examples of Leveraging MSR for Security • Some of the Challenges of MSR Research • Data Centric Research of Our Group • Data Quality Problems Experienced/Observed • Some Recommendations to Deal with Challenges Talk’s Roadmap
  • 5. • Approximately 90% of cyber incidents are caused by the vulnerabilities rooted in software – proprietary or sourced • Software Bill Of Material (SBOM) is becoming ineffective in answering critical questions • Q1: Do we really know what’s in software coming into the organisation? • Q2: How do we establish trust and preserve the security of software coming into the organisation? Software Security and MSR
  • 7. Research Domain Data Sources Goals Potential Benefits Malware & Malicious Codes • Source code • Software Packages • Commits • Posts & Q&A &Discussion • Raising Awareness • Increasing Data availability & Quality • Detection vulnerabilities • Fix & Prevention • Automation of summarization • High accuracy • Intelligent tools • Timeliness • Low cost (Based on public available data source) Vulnerability & Bug • Source code • Software Packages • Commits • Issues & Bug Reports • Social media Posts & Discussion Software Supply Chain Security • Software Packages • Pre-trained models toolkit • Package Manager License Compliance • Source code • License Agreements Privacy • Source code • Tool Chains Secure Coding (Coding Standards and Best Practices) Detection • Source code • Commits Trends of MSR for Security & Privacy Research
  • 8. • Automated vulnerability assessment for third-party app/library patching (exploitability, impact, and severity based on CVSS (Common Vulnerability Scoring System)) using ML models • National Vulnerabilities Database (NVD) • Automated software vulnerability assessment with concept drift, MSR ‘19 • Unveiling developers' key challenges & solutions when addressing vulnerabilities in practice • Q&A websites (Stack Overflow & Security StackExchange • A large-scale study of security vulnerability support on developer Q&A websites, EASE '21 8 MSR4S - Single Source of Data
  • 9. • Predicting vulnerabilities in code functions and statements • LineVD: statement-level vulnerability detection using graph neural networks, MSR ‘22 • Comparison learning-based and rule-based techniques for vulnerability prediction • An empirical study of rule-based and learning-based approaches for static application security testing, ESEM '21 • NVD (Reported fixing commits) + GitHub (Source code of commits) 9 MSR4S – Multiple Sources of Data
  • 10. • Automated prior-fixing vulnerability assessment (vulnerability exploitability, impact, and severity based on CVSS) in code functions using ML models • DeepCVA: automated commit-level vulnerability assessment with deep multi-task learning, ASE '21 • Automated just-in-time vulnerability assessment (exploitability, impact, and severity based on CVSS) in code commits using ML models • On the use of fine-grained vulnerable code statements for software vulnerability assessment models, MSR ’22 • NVD (Reported fixing commits & CVSS) + GitHub (Source code of commits) 10 MSR4S – Multiple Sources of Data
  • 11. • Data Extraction Phase • Limited support for automated extraction • Keeping the extracted data up to data • Keeping the data scraper up to date • Data Pre-processing • Different version control systems (e.g., git, CVS, SVN) • Lack of standard in tools for bug reporting and data collection • Data quality • Data Integration Phase • Correlating data points across technical domains • Linking data from different resources (e.g GitHub data with stack overflow posts) Challenges in MSR Research
  • 12. Data Centric R&D for Software Security
  • 13. Data-Driven Software Security at CREST Software Vulnerability Prediction Software Vulnerability Assessment & Prioritisation Software Vulnerability Knowledge Support 1. A survey on data-driven software vulnerability assessment and prioritization (CSUR '22) 2. On the use of fine-grained vulnerable code statements for software vulnerability assessment models (MSR '22) 3. An investigation into inconsistency of software vulnerability severity across data sources (SANER '22) 4. DeepCVA: Automated commit-level vulnerability assessment with deep multi-task learning (ASE '21) 5. Automated software vulnerability assessment with concept drift (MSR '19) 1. LineVD: statement-level vulnerability detection using graph neural networks (MSR '22) 2. Noisy label learning for security defects (MSR '22) 3. KGSecConfig: A Knowledge Graph Based Approach for Secured Container Orchestrator Configuration (SANER '22) 4. An empirical study of rule-based and learning-based approaches for static application security testing (ESEM '21) 1. An empirical study of developers’ discussions about security challenges of different programming languages (EMSE '22) 2. Well begun is half done: an empirical study of exploitability & impact of base-image vulnerabilities (SANER '22) 3. PUMiner: Mining security posts from developer question and answer websites with PU learning (MSR '20) 4. A large-scale study of security vulnerability support on developer Q&A websites (EASE '21)
  • 14. Software Vulnerability Prediction Software Vulnerability Assessment & Prioritisation DATA Software Vulnerability Knowledge Support Data-Driven Software Security at CREST
  • 16. Data Quality for Data-Driven Software Security DATA Challenges (What, Why, So What) Recommendations (Dos & Donts)
  • 17. Security Data Challenges Scarcity (In)Accuracy (Ir)Relevance Redundancy (Mis)Coverage Non- Representative- ness Drift (In)Accessibility (Re)Use Maliciousness
  • 18. Info (What) Cause (Why) Impact (So What) Scarcity (In)Accuracy (Ir)Relevance Redundancy (Mis)Coverage Non- Representative- ness Drift (In)Accessibility (Re)Use Maliciousness Security Data Challenges
  • 19. Data Scarcity • Hundreds/Thousands security issues vs. Million images • Security issues < 10% of all reported issues (even worse for new projects) Info Causes • Lack of explicit labeling/understanding of security issues • Imperfect data collection (precision vs. recall vs. effort) • Rare occurence of certain types of security issues Impacts • Leading to imbalanced data • Lacking data to train high-performing ML models • More data beats a cleverer algorithm R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447. T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories (MSR), 2020, 350-361.
  • 20. Data (In)Accuracy • (Non-)Security issues not labelled as such • FN: Critical vulnerabilities unfixed for long time (> 3 yrs) • FP: Wasting inspection effort Info Causes • We don't know what we don't know (unknown unknowns) • Lack of reporting or silent patch of security issues • Tangled changes (fixing non-sec. & sec. issues together) Impacts • Criticality: Systematic labeling errors > random errors • Making ML models learn the wrong patterns • Introducing backdoors of ML models R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447. T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729. Without latent SVs With latent SVs
  • 21. Data (Ir)Relevance • Not all input is useful for predicting security issues • Ex1: Code comments for predicting vulnerabilities?! • Ex2: A file containing fixed version update is vulnerable?! Info Causes • Lack of data exploratory analysis • Lack of domain expertise (e.g., NLPers working in SSE) • Trying to beat that state-of-the-art Impacts • Negatively affecting the construct validity • Leading to model overfitting Security fixes (87c89f0) in theApache jspwiki project T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633. T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
  • 22. Data Redundancy • Same security issues found across different software versions, branches, & even projects Info Causes • Cloned projects from mature projects (e.g., Linux kernel) • Merged code from feature branches into the master branch • Renamed files/functions with the same code content • Cosmetic-only changes (different white spaces or new lines) Impacts • Limiting learning capabilities of ML models • Leading to bias and overfitting for ML models • Inflating model performance (same training/testing samples) Thousands of cloned projects sharing same vulns as the Linux kernel Vulnerability prediction performance before & after removing the redundancies in common datasets T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633. R. Croft, et al., Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, 1- 12.
  • 23. Data (Mis)Coverage • Security issues spanning multiple lines, functions, files, modules, versions, & even projects, but... • Current approaches mostly focus on single function/file Info Causes • Partial security fixes in multiple versions • For ML: coverage vs. size ( ↑ as granularity ↓ ) • For Devs: coverage vs. convenience (inspection effort) • Fixed-size (truncated) input required for some (DL) models Impacts • Lacking context for training ML models to detect complex (e.g., intra-function/file/module) security issues • Incomplete data for ML as security-related info. truncated User's malicious input? How to know using only the current function? Deleted line Added line Func_a1 Func_b1 File b Module A File a Func_a2 Func_b2 Func_b3 Assume that ModuleA is vulnerable then: • Vuln. module: 1 • Vuln. files: 2 • Vuln. functions: 5 T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633. T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729. Cylinder vs. Circle vs. Rectangle?
  • 24. Data (Non-)Representativeness • Real-world security issues (e.g., NVD) vastly different from synthetic ones (e.g., SARD) • Varying characteristics of security issues across projects Info Causes • Synthetic data: local & created by pre-defined rules • Real-world data: inter-dependent & complex • Different features & nature between apps Impacts • Perf. (F1) gap: Synthetic (0.85) vs. Real-world (0.15) • Lack of generalisability & transferability of ML models • Within-project prediction > Cross-project prediction R. Croft, et al., IEEE Transactions on Software Engineering, 2022. S. Chakraborty, et al., IEEE Transactions on Software Engineering, 2021. Real-world data Synthetic data
  • 25. Data Drift • Unending battle between attackers & defenders • Evolving threat landscapes  Changing characteristics of security issues over time Info Impacts • Out-of-Vocabulary words  Degrading performance • Data leakage  Unrealistic performance (up to ~5 times overfitting) using non-temporal evaluation technique R. Croft, et al., Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, 338-348. T. H. M. Le, et al., Proceedings of the 16th International Conference on Mining Software Repositories, 2019, 371-382. • New terms for emerging attacks, defenses, & issues • Changing software features & implementation over time Causes New threats/affected products in NVD over time Test Non-temporal evaluation  (Future) data leakage
  • 26. Data (In)Accessibility • Security data not always shared • Shared yet incomplete data  Re-collected data may be different from original data Info Causes • Privacy concerns (e.g., commercial projects) or not?! • Too large size for storage (artifact ID vs. artifact content) • Data values can change over time (e.g., devs' experience) Impacts • Limited reproducibility of the existing results • Limited generalisability of the ML models (e.g., open- source vs. closed-source data) "We have anonymously uploaded the database to https://www.dropbox.com/s/anonymised_link so the reviewers can access the raw data during the review process. We will release the data to the community together with the paper." Click the link R. Croft, et al., Empirical Software Engineering, 2022, 27: 1-52. T. H. M. Le, et al., Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2021, 109-118.
  • 27. Data (Re-)Use • Finding a needle (security issue) in a haystack (raw data) • And ... haystack can be huge • Reuse existing data > Update/collect new data Info • Completely trusting/relying on existing datasets • Unsuitable infrastructure to collect/store raw data Causes Impacts • Making ML models become obsolete & less generalised • Being unaware of the issues in the current data  Error propagation in following studies 1,229 VCCs (> 1TB) 200 real-world Java projects Clone projects ~70k security posts ~20 mil posts (> 50 GB) T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729. T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories, 2020, 350-361.
  • 28. Data Maliciousness • Threat/security data is itself a threat (e.g., new vulns) • Using/sharing threat data without precautions Info Causes • Simply an oversight / afterthought • Private experiment vs. Public disclosure (Big difference!) • Open science vs. Responsible science Impacts • Violating ethics of the community • Shared security data maybe exploited by attackers • Affecting maintainers & users of target systems The identified vulns later get exploited by attackers They move on to submit & publish the paper They don't report the vulnerabilities to project maintainers to fix They identify new vulnerabilities using the model Researchers develop a SOTAML-based vulnerability prediction model An example of malicious/irresponsible data sharing
  • 30.
  • 31. • Consideration of Label Noise in Negative Class • Semi-Supervised, e.g., self-training, positive or Unlabeled training on unlabeled set • Consideration of Timeliness • Currently labeled data & more positive samples; Preserve data sequence for training • Use of Data Visualization • Try to achieve better data understandability for non data scientists • Creation and Use of Diverse Language Datasets • Bug seeding into semantically similar languages • Use of Data Quality Assessment Criteria • Determine and use specific data quality assessment approaches • Better Data Sharing and Governance • Provide exact details and processes of data preparation Recommendations for Dealing with Data Quality Issues
  • 32. Acknowledgements • This talk is based on the research studies carried out by the CREST researchers, particularly by Roland Croft, Triet Le, Yongzheng Xie, Mehdi Kholoosi. • Triet Le led the efforts for developing the content for this presentation • Discussions in the SSI cluster of the CREST provided insights included in this presentation • Our research partially funded by the Cybersecurity CRC • We are grateful to the students, RAs and Open Source data communities
  • 33. CRICOS 00123M Contact: Ali Babar ali.babar@adelaide.edu.au