SlideShare a Scribd company logo
1 of 32
Building ML Based Software Security Solutions:
“Data Wrangling” Lessons and Recommendations
Ali Babar
CREST – Centre for Research on Engineering Software Technologies
USC, 9th October, 2023
• CREST’s Data Centric Software Security Research
• Data-Centric Software Security Quality Assurance
• Data Quality Problems Experienced/Observed
• Some Recommendations to Deal with Challenges
Outline
• Approximately 90% of cyber incidents are caused by the
vulnerabilities rooted in software – proprietary or sourced
• Software Bill Of Material (SBOM) is becoming ineffective in
answering critical questions
• Q1: Do we really know what’s in software coming into the organisation?
• Q2: How do we establish trust and preserve the security of software coming
into the organisation?
Software Vulnerabilities and Cybersecurity Incidents
Data-Driven Software Security at CREST
Software Vulnerability
Prediction
Software Vulnerability
Assessment &
Prioritisation
Software Vulnerability Knowledge Support
1. A survey on data-driven software vulnerability assessment and prioritization (CSUR '22)
2. On the use of fine-grained vulnerable code statements for software vulnerability
assessment models (MSR '22)
3. An investigation into inconsistency of software vulnerability severity across data sources
(SANER '22)
4. DeepCVA: Automated commit-level vulnerability assessment with deep multi-task
learning (ASE '21)
5. Automated software vulnerability assessment with concept drift (MSR '19)
1. LineVD: statement-level vulnerability detection using graph
neural networks (MSR '22)
2. Noisy label learning for security defects (MSR '22)
3. KGSecConfig: A Knowledge Graph Based Approach for
Secured Container Orchestrator Configuration (SANER '22)
4. An empirical study of rule-based and learning-based
approaches for static application security testing (ESEM '21)
1. An empirical study of developers’ discussions about security challenges of different programming languages (EMSE '22)
2. Well begun is half done: an empirical study of exploitability & impact of base-image vulnerabilities (SANER '22)
3. PUMiner: Mining security posts from developer question and answer websites with PU learning (MSR '20)
4. A large-scale study of security vulnerability support on developer Q&A websites (EASE '21)
No perfectly clean dataset of vulnerabilities:
• Label noise
• False positives of data collection
• Constantly discovered new vulnerabilities
• Data noise (e.g., duplicates)
Assumption Reality
GOALS
Robust & noise-tolerant
learning techniques
Data Quality
Assessment & Analysis
Data Quality for Data-Driven Software Security
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
Data-Centric Software Security Assurance
• Data Requirements determine the types and source of data for building a model
• “Data Wrangling” (collection, labelling and cleaning) steps of ML Workflow
• “Data Wrangling” (or preparation) can take up to 25% of an industry project time
14
Data Preparation for ML Based Security Solutions
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
• Software Vulnerabilities Prediction (SVP) approaches
purport to learn from history and predict SV
• Prediction approaches are becoming popular as early
lifecycle software security assurance techniques
• SVP models may or may not analyse program syntax
and semantic; the latter leverages DL
• Being ML dependent, SVP needs data preparation as
per the workflow of ML shown on the last slide
• SVP data preparation needs several important
consideration including source and labelling
Data-Centric Software Security Assurance
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
• Data requirements vary depending upon the context
and capabilities needed of a ML model
• Data may be collected from real-world, synthetic
code or mixed code – training/testing model
• Trade-off between scarcity and realism
• Gathered data need labelling – provided by
developers (NVD), tools based, or based on patterns
• Labelling non-vulnerable class is problematic
• Data cleaning is required for a certain format and
reducing noise from collected/labelled data
Data Preparation Consideration for SVP
Data Quality Challenges in MLC
Software Vulnerability
Prediction
Software Vulnerability
Assessment &
Prioritisation
Software Vulnerability Knowledge Support
DATA
Data-Driven Software Security at CREST
Data Quality for Data-Driven Software Security
DATA
Challenges (What, Why, So What)
Recommendations (Dos & Donts)
Security Data Challenges
Info
(What)
Cause
(Why)
Impact
(So What)
Security Data Challenges
Data Scarcity
• Hundreds/Thousands security issues vs. Million images
• Security issues < 10% of all reported issues (even worse
for new projects)
Info
Causes
• Lack of explicit labeling/understanding of security issues
• Imperfect data collection (precision vs. recall vs. effort)
• Rare occurence of certain types of security issues
Impacts
• Leading to imbalanced data
• Lacking data to train high-performing ML models
• More data beats a cleverer algorithm
R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447.
T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories (MSR), 2020, 350-361.
Data (In)Accuracy
• (Non-)Security issues not labelled as such
• FN: Critical vulnerabilities unfixed for long time (> 3 yrs)
• FP: Wasting inspection effort
Info
Causes
• We don't know what we don't know (unknown unknowns)
• Lack of reporting or silent patch of security issues
• Tangled changes (fixing non-sec. & sec. issues together)
Impacts
• Criticality: Systematic labeling errors > random errors
• Making ML models learn the wrong patterns
• Introducing backdoors of ML models
R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133.
R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Without latent SVs With latent SVs
Data (Ir)Relevance
• Not all input is useful for predicting security issues
• Ex1: Code comments for predicting vulnerabilities?!
• Ex2: A file containing fixed version update is vulnerable?!
Info
Causes
• Lack of data exploratory analysis
• Lack of domain expertise (e.g., NLPers working in SSE)
• Trying to beat that state-of-the-art
Impacts
• Negatively affecting the construct validity
• Reducing model performance (e.g., code comments
reduced SVP performance by 7x in Python)
Security fixes (87c89f0) in the Apache jspwiki project
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Data Redundancy
• Same security issues found across different software versions,
branches, & even projects
Info
Causes
• Cloned projects from mature projects (e.g., Linux kernel)
• Merged code from feature branches into the master branch
• Renamed files/functions with the same code content
• Cosmetic-only changes (different white spaces or new lines)
Impacts
• Limiting learning capabilities of ML models
• Leading to bias and overfitting for ML models
• Inflating model performance (same training/testing samples)
Thousands of cloned projects
sharing same vulns as the Linux kernel
Vulnerability prediction performance before & after
removing the redundancies in common datasets
R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133.
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
R. Croft, et al., Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, 1-12.
Data (Mis)Coverage
• Security issues spanning multiple lines, functions, files,
modules, versions, & even projects, but...
• Current approaches mostly focus on single function/file
Info
Causes
• Partial security fixes in multiple versions
• For ML: coverage vs. size (↑ as granularity ↓)
• For Devs: coverage vs. convenience (inspection effort)
• Fixed-size (truncated) input required for some (DL) models
Impacts
• Lacking context for training ML models to detect complex
(e.g., intra-function/file/module) security issues
• Incomplete data for ML as security-related info. truncated
User's malicious input?
How to know using only
the current function?
Deleted line
Added line
Func_a1 Func_b1
File b
Module A
File a
Func_a2 Func_b2
Func_b3
Assume that Module A
is vulnerable then:
• Vuln. module: 1
• Vuln. files: 2
• Vuln. functions: 5
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Cylinder vs. Circle vs. Rectangle?
Data (Non-)Representativeness
• Real-world security issues (e.g., NVD) vastly different from
synthetic ones (e.g., SARD)
• Varying characteristics of security issues across projects
Info
Causes
• Synthetic data: local & created by pre-defined rules
• Real-world data: inter-dependent & complex
• Different features & nature between apps
Impacts
• Perf. (F1) gap: Synthetic (0.85) vs. Real-world (0.15)
• Lack of generalisability & transferability of ML models
• Within-project prediction > Cross-project prediction
R. Croft, et al., IEEE Transactions on Software Engineering, 2022.
S. Chakraborty, et al., IEEE Transactions on Software Engineering, 2021.
Synthetic
data
Real-world
data
Data Drift
• Unending battle between attackers & defenders
• Evolving threat landscapes  Changing characteristics
of security issues over time
Info
Causes
• New terms for emerging attacks, defenses, & issues
• Changing software features & implementation over time
Impacts
• Out-of-Vocabulary words  Degrading performance
• Data leakage  Unrealistic performance (up to ~5 times
overfitting) using non-temporal evaluation technique
R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133.
R. Croft, et al., Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, 338-348.
T. H. M. Le, et al., Proceedings of the 16th International Conference on Mining Software Repositories, 2019, 371-382.
New threats/affected products in NVD over time
Test
Non-temporal evaluation
 (Future) data leakage
Data (In)Accessibility
• Security data not always shared
• Shared yet incomplete data  Re-collected data may be
different from original data
Info
Causes
• Privacy concerns (e.g., commercial projects) or not?!
• Too large size for storage (artifact ID vs. artifact content)
• Data values can change over time (e.g., devs' experience)
Impacts
• Limited reproducibility of the existing results
• Limited generalisability of the ML models (e.g., open-
source vs. closed-source data)
"We have anonymously uploaded the database
to https://www.dropbox.com/s/anonymised_link
so the reviewers can access the raw data
during the review process. We will release the
data to the community together with the paper."
Click the link
R. Croft, et al., Empirical Software Engineering, 2022, 27: 1-52.
T. H. M. Le, et al., Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2021, 109-118.
Data (Re-)Use
• Finding a needle (security issue) in a haystack (raw data)
• And ... haystack can be huge
• Reuse existing data > Update/collect new data
Info
Causes
• Completely trusting/relying on existing datasets
• Unsuitable infrastructure to collect/store raw data
Impacts
• Making ML models become obsolete & less generalised
• Being unaware of the issues in the current data
 Error propagation in following studies
1,229 VCCs
(> 1TB)
200 real-world
Java projects
Clone
projects
~70k security posts
~20 mil posts (> 50 GB)
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories, 2020, 350-361.
Data Maliciousness
• Threat/security data is itself a threat (e.g., new vulns)
• Using/sharing threat data without precautions
Info
Causes
• Simply an oversight / afterthought
• Private experiment vs. Public disclosure (Big difference!)
• Open science vs. Responsible science
Impacts
• Violating ethics of the community
• Shared security data maybe exploited by attackers
• Affecting maintainers & users of target systems
The identified vulns later get exploited by attackers
They move on to submit & publish the paper
They don't report the vulnerabilities to project
maintainers to fix
They identify new vulnerabilities using the model
Researchers develop a SOTA ML-based
vulnerability prediction model
An example of malicious/irresponsible data sharing
Slide 35
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul 0.543 0.830 0.999 0.824 0.761
Devign 0.800 0.899 0.991 0.944 0.811
D2A 0.286 0.021 0.531 0.981 0.844
Significantly reduce SVP performance, e.g., data accuracy (30 – 80% ↓ in MCC)
Automatic data cleaning: ↑ performance by ~20%, but still far from perfect
Prevalent noise in current real-world vulnerability datasets
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
Data Quality for Data-Driven Software Security
Some Recommendations
• Identification of Missing Vulnerability Data
• Automatic labeling of silent fixes & latent vulnerabilities (beware of false positives)
• Consideration of Label Noise
• Noisy Label Learning and/or Semi-Supervised Learning (small clean data & large unlabelled data)
• Consideration of Timeliness
• Currently labeled data & more positive samples; Preserve data sequence for training
• Use of Data Visualization
• Try to achieve better data understandability for non data scientists
• Creation and Use of Diverse Language Datasets
• Bug seeding into semantically similar languages
• Use of Data Quality Assessment Criteria
• Determine and use specific data quality assessment approaches
• Better Data Sharing and Governance
• Provide exact details and processes of data preparation
Recommendations for Dealing with Data Quality Issues
Acknowledgements
• This talk is based on the research studies carried out by the CREST researchers,
particularly by Roland Croft, Triet Le, Yongzheng Xie, Mehdi Kholoosi.
• Triet Le led the efforts for developing the content for this presentation
• Discussions in the SSI cluster of the CREST provided insights included in this presentation
• Our research partially funded by the Cybersecurity CRC
• We are grateful to the students, RAs and Open Source data communities
CRICOS 00123M
Contact: Ali Babar
ali.babar@adelaide.edu.au

More Related Content

Similar to Security Data Quality Challenges

Software Security Initiatives
Software Security InitiativesSoftware Security Initiatives
Software Security InitiativesMarco Morana
 
Trends in Embedded Software Engineering
Trends in Embedded Software EngineeringTrends in Embedded Software Engineering
Trends in Embedded Software EngineeringAditya Kamble
 
chap-1 : Vulnerabilities in Information Systems
chap-1 : Vulnerabilities in Information Systemschap-1 : Vulnerabilities in Information Systems
chap-1 : Vulnerabilities in Information SystemsKashfUlHuda1
 
WIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMSWIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMScscpconf
 
[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principles[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principlesOWASP
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataIGEEKS TECHNOLOGIES
 
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...Agile Testing Alliance
 
Running Head 2Week #8 MidTerm Assignment .docx
Running Head    2Week #8 MidTerm Assignment               .docxRunning Head    2Week #8 MidTerm Assignment               .docx
Running Head 2Week #8 MidTerm Assignment .docxhealdkathaleen
 
Solnet dev secops meetup
Solnet dev secops meetupSolnet dev secops meetup
Solnet dev secops meetuppbink
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Standardizing Source Code Security Audits
Standardizing Source Code Security AuditsStandardizing Source Code Security Audits
Standardizing Source Code Security Auditsijseajournal
 
Security Patterns: Research Direction, Metamodel, Application and Verification
Security Patterns: Research Direction, Metamodel, Application and VerificationSecurity Patterns: Research Direction, Metamodel, Application and Verification
Security Patterns: Research Direction, Metamodel, Application and VerificationHironori Washizaki
 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docpraveena06
 
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining TechniquesImprovement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining Techniquesijdmtaiir
 
AMI Security 101 - Smart Grid Security East 2011
AMI Security 101 - Smart Grid Security East 2011AMI Security 101 - Smart Grid Security East 2011
AMI Security 101 - Smart Grid Security East 2011dma1965
 
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...Deborah Schalm
 
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...DevOps.com
 

Similar to Security Data Quality Challenges (20)

Cm24585587
Cm24585587Cm24585587
Cm24585587
 
Software Security Initiatives
Software Security InitiativesSoftware Security Initiatives
Software Security Initiatives
 
Trends in Embedded Software Engineering
Trends in Embedded Software EngineeringTrends in Embedded Software Engineering
Trends in Embedded Software Engineering
 
chap-1 : Vulnerabilities in Information Systems
chap-1 : Vulnerabilities in Information Systemschap-1 : Vulnerabilities in Information Systems
chap-1 : Vulnerabilities in Information Systems
 
WIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMSWIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMS
 
[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principles[Warsaw 26.06.2018] SDL Threat Modeling principles
[Warsaw 26.06.2018] SDL Threat Modeling principles
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
 
Running Head 2Week #8 MidTerm Assignment .docx
Running Head    2Week #8 MidTerm Assignment               .docxRunning Head    2Week #8 MidTerm Assignment               .docx
Running Head 2Week #8 MidTerm Assignment .docx
 
Solnet dev secops meetup
Solnet dev secops meetupSolnet dev secops meetup
Solnet dev secops meetup
 
Se research update
Se research updateSe research update
Se research update
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Standardizing Source Code Security Audits
Standardizing Source Code Security AuditsStandardizing Source Code Security Audits
Standardizing Source Code Security Audits
 
Security Patterns: Research Direction, Metamodel, Application and Verification
Security Patterns: Research Direction, Metamodel, Application and VerificationSecurity Patterns: Research Direction, Metamodel, Application and Verification
Security Patterns: Research Direction, Metamodel, Application and Verification
 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.doc
 
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining TechniquesImprovement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining Techniques
 
Only Abstract
Only AbstractOnly Abstract
Only Abstract
 
AMI Security 101 - Smart Grid Security East 2011
AMI Security 101 - Smart Grid Security East 2011AMI Security 101 - Smart Grid Security East 2011
AMI Security 101 - Smart Grid Security East 2011
 
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
 
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
Leverage DevOps & Agile Development to Transform Your Application Testing Pro...
 

More from CREST @ University of Adelaide

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...CREST @ University of Adelaide
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsCREST @ University of Adelaide
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...CREST @ University of Adelaide
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingCREST @ University of Adelaide
 
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...CREST @ University of Adelaide
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...CREST @ University of Adelaide
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...CREST @ University of Adelaide
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...CREST @ University of Adelaide
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...CREST @ University of Adelaide
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewCREST @ University of Adelaide
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...CREST @ University of Adelaide
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingCREST @ University of Adelaide
 

More from CREST @ University of Adelaide (20)

Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
Mobile Devices: Systemisation of Knowledge about Privacy Invasion Tactics and...
 
Making Software and Software Engineering visible
Making Software and Software Engineering visibleMaking Software and Software Engineering visible
Making Software and Software Engineering visible
 
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based SystemsUnderstanding and Addressing Architectural Challenges of Cloud- Based Systems
Understanding and Addressing Architectural Challenges of Cloud- Based Systems
 
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
DevSecOps: Continuous Engineering with Security by Design: Challenges and Sol...
 
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security PatchingA Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
A Deep Dive into the Socio-Technical Aspects of Delays in Security Patching
 
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...A Decentralised Platform for Provenance Management of Machine Learning Softwa...
A Decentralised Platform for Provenance Management of Machine Learning Softwa...
 
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
Privacy Engineering: Enabling Mobility of Mental Health Services with Data Pr...
 
Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...Falling for Phishing: An Empirical Investigation into People's Email Response...
Falling for Phishing: An Empirical Investigation into People's Email Response...
 
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
An Experience Report on the Design and Implementation of an Ad-hoc Blockchain...
 
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Compar...
 
Detecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic ReviewDetecting Misuses of Security APIs: A Systematic Review
Detecting Misuses of Security APIs: A Systematic Review
 
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
Chen_Reading Strategies for Graph Visualizations that Wrap Around in Torus To...
 
Data Quality for Software Vulnerability Dataset
Data Quality for Software Vulnerability DatasetData Quality for Software Vulnerability Dataset
Data Quality for Software Vulnerability Dataset
 
Mod2Dash Presentation
Mod2Dash PresentationMod2Dash Presentation
Mod2Dash Presentation
 
Run-time Patching and updating Impact Estimation
Run-time Patching and updating Impact EstimationRun-time Patching and updating Impact Estimation
Run-time Patching and updating Impact Estimation
 
ECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case StudyECSA 2023 Ubuntu Case Study
ECSA 2023 Ubuntu Case Study
 
Energy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data ProcessingEnergy Efficiency Evaluation of Local and Offloaded Data Processing
Energy Efficiency Evaluation of Local and Offloaded Data Processing
 
Designing Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain NetworksDesigning Quality-Driven Blockchain Networks
Designing Quality-Driven Blockchain Networks
 
Privacy Engineering in the Wild
Privacy Engineering in the WildPrivacy Engineering in the Wild
Privacy Engineering in the Wild
 
CREST Overview
CREST OverviewCREST Overview
CREST Overview
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Security Data Quality Challenges

  • 1. Building ML Based Software Security Solutions: “Data Wrangling” Lessons and Recommendations Ali Babar CREST – Centre for Research on Engineering Software Technologies USC, 9th October, 2023
  • 2. • CREST’s Data Centric Software Security Research • Data-Centric Software Security Quality Assurance • Data Quality Problems Experienced/Observed • Some Recommendations to Deal with Challenges Outline
  • 3.
  • 4.
  • 5. • Approximately 90% of cyber incidents are caused by the vulnerabilities rooted in software – proprietary or sourced • Software Bill Of Material (SBOM) is becoming ineffective in answering critical questions • Q1: Do we really know what’s in software coming into the organisation? • Q2: How do we establish trust and preserve the security of software coming into the organisation? Software Vulnerabilities and Cybersecurity Incidents
  • 6.
  • 7. Data-Driven Software Security at CREST Software Vulnerability Prediction Software Vulnerability Assessment & Prioritisation Software Vulnerability Knowledge Support 1. A survey on data-driven software vulnerability assessment and prioritization (CSUR '22) 2. On the use of fine-grained vulnerable code statements for software vulnerability assessment models (MSR '22) 3. An investigation into inconsistency of software vulnerability severity across data sources (SANER '22) 4. DeepCVA: Automated commit-level vulnerability assessment with deep multi-task learning (ASE '21) 5. Automated software vulnerability assessment with concept drift (MSR '19) 1. LineVD: statement-level vulnerability detection using graph neural networks (MSR '22) 2. Noisy label learning for security defects (MSR '22) 3. KGSecConfig: A Knowledge Graph Based Approach for Secured Container Orchestrator Configuration (SANER '22) 4. An empirical study of rule-based and learning-based approaches for static application security testing (ESEM '21) 1. An empirical study of developers’ discussions about security challenges of different programming languages (EMSE '22) 2. Well begun is half done: an empirical study of exploitability & impact of base-image vulnerabilities (SANER '22) 3. PUMiner: Mining security posts from developer question and answer websites with PU learning (MSR '20) 4. A large-scale study of security vulnerability support on developer Q&A websites (EASE '21)
  • 8. No perfectly clean dataset of vulnerabilities: • Label noise • False positives of data collection • Constantly discovered new vulnerabilities • Data noise (e.g., duplicates) Assumption Reality GOALS Robust & noise-tolerant learning techniques Data Quality Assessment & Analysis Data Quality for Data-Driven Software Security 1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*) 2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*) 3. Noisy label learning for security defects, MSR '22 (CORE A) 4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
  • 10. • Data Requirements determine the types and source of data for building a model • “Data Wrangling” (collection, labelling and cleaning) steps of ML Workflow • “Data Wrangling” (or preparation) can take up to 25% of an industry project time 14 Data Preparation for ML Based Security Solutions 1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*) 2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*) 3. Noisy label learning for security defects, MSR '22 (CORE A) 4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
  • 11. • Software Vulnerabilities Prediction (SVP) approaches purport to learn from history and predict SV • Prediction approaches are becoming popular as early lifecycle software security assurance techniques • SVP models may or may not analyse program syntax and semantic; the latter leverages DL • Being ML dependent, SVP needs data preparation as per the workflow of ML shown on the last slide • SVP data preparation needs several important consideration including source and labelling Data-Centric Software Security Assurance 1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*) 2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*) 3. Noisy label learning for security defects, MSR '22 (CORE A) 4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
  • 12. • Data requirements vary depending upon the context and capabilities needed of a ML model • Data may be collected from real-world, synthetic code or mixed code – training/testing model • Trade-off between scarcity and realism • Gathered data need labelling – provided by developers (NVD), tools based, or based on patterns • Labelling non-vulnerable class is problematic • Data cleaning is required for a certain format and reducing noise from collected/labelled data Data Preparation Consideration for SVP
  • 14. Software Vulnerability Prediction Software Vulnerability Assessment & Prioritisation Software Vulnerability Knowledge Support DATA Data-Driven Software Security at CREST
  • 15. Data Quality for Data-Driven Software Security DATA Challenges (What, Why, So What) Recommendations (Dos & Donts)
  • 18. Data Scarcity • Hundreds/Thousands security issues vs. Million images • Security issues < 10% of all reported issues (even worse for new projects) Info Causes • Lack of explicit labeling/understanding of security issues • Imperfect data collection (precision vs. recall vs. effort) • Rare occurence of certain types of security issues Impacts • Leading to imbalanced data • Lacking data to train high-performing ML models • More data beats a cleverer algorithm R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447. T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories (MSR), 2020, 350-361.
  • 19. Data (In)Accuracy • (Non-)Security issues not labelled as such • FN: Critical vulnerabilities unfixed for long time (> 3 yrs) • FP: Wasting inspection effort Info Causes • We don't know what we don't know (unknown unknowns) • Lack of reporting or silent patch of security issues • Tangled changes (fixing non-sec. & sec. issues together) Impacts • Criticality: Systematic labeling errors > random errors • Making ML models learn the wrong patterns • Introducing backdoors of ML models R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133. R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447. T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729. Without latent SVs With latent SVs
  • 20. Data (Ir)Relevance • Not all input is useful for predicting security issues • Ex1: Code comments for predicting vulnerabilities?! • Ex2: A file containing fixed version update is vulnerable?! Info Causes • Lack of data exploratory analysis • Lack of domain expertise (e.g., NLPers working in SSE) • Trying to beat that state-of-the-art Impacts • Negatively affecting the construct validity • Reducing model performance (e.g., code comments reduced SVP performance by 7x in Python) Security fixes (87c89f0) in the Apache jspwiki project T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633. T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
  • 21. Data Redundancy • Same security issues found across different software versions, branches, & even projects Info Causes • Cloned projects from mature projects (e.g., Linux kernel) • Merged code from feature branches into the master branch • Renamed files/functions with the same code content • Cosmetic-only changes (different white spaces or new lines) Impacts • Limiting learning capabilities of ML models • Leading to bias and overfitting for ML models • Inflating model performance (same training/testing samples) Thousands of cloned projects sharing same vulns as the Linux kernel Vulnerability prediction performance before & after removing the redundancies in common datasets R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133. T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633. R. Croft, et al., Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, 1-12.
  • 22. Data (Mis)Coverage • Security issues spanning multiple lines, functions, files, modules, versions, & even projects, but... • Current approaches mostly focus on single function/file Info Causes • Partial security fixes in multiple versions • For ML: coverage vs. size (↑ as granularity ↓) • For Devs: coverage vs. convenience (inspection effort) • Fixed-size (truncated) input required for some (DL) models Impacts • Lacking context for training ML models to detect complex (e.g., intra-function/file/module) security issues • Incomplete data for ML as security-related info. truncated User's malicious input? How to know using only the current function? Deleted line Added line Func_a1 Func_b1 File b Module A File a Func_a2 Func_b2 Func_b3 Assume that Module A is vulnerable then: • Vuln. module: 1 • Vuln. files: 2 • Vuln. functions: 5 T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633. T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729. Cylinder vs. Circle vs. Rectangle?
  • 23. Data (Non-)Representativeness • Real-world security issues (e.g., NVD) vastly different from synthetic ones (e.g., SARD) • Varying characteristics of security issues across projects Info Causes • Synthetic data: local & created by pre-defined rules • Real-world data: inter-dependent & complex • Different features & nature between apps Impacts • Perf. (F1) gap: Synthetic (0.85) vs. Real-world (0.15) • Lack of generalisability & transferability of ML models • Within-project prediction > Cross-project prediction R. Croft, et al., IEEE Transactions on Software Engineering, 2022. S. Chakraborty, et al., IEEE Transactions on Software Engineering, 2021. Synthetic data Real-world data
  • 24. Data Drift • Unending battle between attackers & defenders • Evolving threat landscapes  Changing characteristics of security issues over time Info Causes • New terms for emerging attacks, defenses, & issues • Changing software features & implementation over time Impacts • Out-of-Vocabulary words  Degrading performance • Data leakage  Unrealistic performance (up to ~5 times overfitting) using non-temporal evaluation technique R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133. R. Croft, et al., Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, 338-348. T. H. M. Le, et al., Proceedings of the 16th International Conference on Mining Software Repositories, 2019, 371-382. New threats/affected products in NVD over time Test Non-temporal evaluation  (Future) data leakage
  • 25. Data (In)Accessibility • Security data not always shared • Shared yet incomplete data  Re-collected data may be different from original data Info Causes • Privacy concerns (e.g., commercial projects) or not?! • Too large size for storage (artifact ID vs. artifact content) • Data values can change over time (e.g., devs' experience) Impacts • Limited reproducibility of the existing results • Limited generalisability of the ML models (e.g., open- source vs. closed-source data) "We have anonymously uploaded the database to https://www.dropbox.com/s/anonymised_link so the reviewers can access the raw data during the review process. We will release the data to the community together with the paper." Click the link R. Croft, et al., Empirical Software Engineering, 2022, 27: 1-52. T. H. M. Le, et al., Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2021, 109-118.
  • 26. Data (Re-)Use • Finding a needle (security issue) in a haystack (raw data) • And ... haystack can be huge • Reuse existing data > Update/collect new data Info Causes • Completely trusting/relying on existing datasets • Unsuitable infrastructure to collect/store raw data Impacts • Making ML models become obsolete & less generalised • Being unaware of the issues in the current data  Error propagation in following studies 1,229 VCCs (> 1TB) 200 real-world Java projects Clone projects ~70k security posts ~20 mil posts (> 50 GB) T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729. T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories, 2020, 350-361.
  • 27. Data Maliciousness • Threat/security data is itself a threat (e.g., new vulns) • Using/sharing threat data without precautions Info Causes • Simply an oversight / afterthought • Private experiment vs. Public disclosure (Big difference!) • Open science vs. Responsible science Impacts • Violating ethics of the community • Shared security data maybe exploited by attackers • Affecting maintainers & users of target systems The identified vulns later get exploited by attackers They move on to submit & publish the paper They don't report the vulnerabilities to project maintainers to fix They identify new vulnerabilities using the model Researchers develop a SOTA ML-based vulnerability prediction model An example of malicious/irresponsible data sharing
  • 28. Slide 35 Dataset Accuracy Uniqueness Consistency Completeness Currentness Big-Vul 0.543 0.830 0.999 0.824 0.761 Devign 0.800 0.899 0.991 0.944 0.811 D2A 0.286 0.021 0.531 0.981 0.844 Significantly reduce SVP performance, e.g., data accuracy (30 – 80% ↓ in MCC) Automatic data cleaning: ↑ performance by ~20%, but still far from perfect Prevalent noise in current real-world vulnerability datasets 1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*) 2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*) 3. Noisy label learning for security defects, MSR '22 (CORE A) 4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A) Data Quality for Data-Driven Software Security
  • 30. • Identification of Missing Vulnerability Data • Automatic labeling of silent fixes & latent vulnerabilities (beware of false positives) • Consideration of Label Noise • Noisy Label Learning and/or Semi-Supervised Learning (small clean data & large unlabelled data) • Consideration of Timeliness • Currently labeled data & more positive samples; Preserve data sequence for training • Use of Data Visualization • Try to achieve better data understandability for non data scientists • Creation and Use of Diverse Language Datasets • Bug seeding into semantically similar languages • Use of Data Quality Assessment Criteria • Determine and use specific data quality assessment approaches • Better Data Sharing and Governance • Provide exact details and processes of data preparation Recommendations for Dealing with Data Quality Issues
  • 31. Acknowledgements • This talk is based on the research studies carried out by the CREST researchers, particularly by Roland Croft, Triet Le, Yongzheng Xie, Mehdi Kholoosi. • Triet Le led the efforts for developing the content for this presentation • Discussions in the SSI cluster of the CREST provided insights included in this presentation • Our research partially funded by the Cybersecurity CRC • We are grateful to the students, RAs and Open Source data communities
  • 32. CRICOS 00123M Contact: Ali Babar ali.babar@adelaide.edu.au

Editor's Notes

  1. Our R&D is focused on addressing the most pressing challenges of developing and evaluating new knowledge and tools for engineering software intensive systems with/for IoT, Cloud computing, Cyber physical systems. Our group carries out top quality research on human-centric software security - excellence and industry impact. We have developed significant capabilities in digital twins, defence systems, and AgFood systems as core domain – The presentation is based on our studies in one of the key areas of our R&D – Data-Driven Software Security Solutions.
  2. To start with, I'll show you these icons that many of you would recognize. They are real-world mission systems of many large tech companies in the world. Software underpinning such mission systems has to be trustworthy and secure – security and privacy are key concerns nowadays. Software underpinning such apps can also contain potential security risks (vulnerabilities); increasingly ML based approaches are being used for detecting software security risks – such solutions are inherently reliant on data.
  3. For example, last year, you may have heard about the Log4J vulnerability, which took the entire Internet by storm. Log4J vulnerability’s exploitation could have affected millions of systems around the world. And this’s just one example among the thousands of critical vulnerabilities discovered every single year. So you can see how much damage software vulnerabilities can cause if we don’t detect and address them on time.
  4. Actually, 90% of cyber incidents are caused by software vulnerabilities. White House EO 14028, “enhancing software supply chain security Can SBOM alone solve this problem?
  5. We are developing tools and techniques and distilling practices to provide early information about software vulnerabilities for both expert and non-expert users. We leverage various data sources to develop AI-powered techniques to automatically extract insights into the whole vulnerability lifecycle, ranging from early detection of these vulnerabilities, to assessing them in terms of their probability of exploitation and impacts, and then giving recommendations to developers to plan and prioritize the mitigation and fixing. And currently, our research targets both traditional and contemporary AI-based systems as well as the supporting software infrastructure such as Docker containers for these systems.
  6. So as you can see from the previous slide, data-driven software security is one of the key research areas at CREST.
  7. So as you can see from the previous slide, data-driven software security is one of the key research areas at CREST. Here’re the A/A* publications we had in the area grouped by relevant tasks.
  8. To develop SVP models, the quality of input data is a very integral requirement. In reality, it's very challenging to have a perfectly clean dataset of all vulnerabilities as current data collection heuristics are still far from perfect and new vulnerabilities are constantly being discovered, including in existing code. This means that we're uncertain about the actual labels of many samples. With such noisy and imperfect data, it's very difficult to develop reliable models. [click-1] Our research activities are focused on understanding and addressing data quality issues in order to make the downstream data-driven models more reliable and trustworthy. We have also design/dveloped robust learning techniques to make the models more tolerant to data noise and imperfections. However, that is not the focus of today’s talk.
  9. We reported the results of this SLR as a TSE journal paper.
  10. Column “Research Domain”: shows the current hot topics in MSR research Column “Data Sources”: shows the data sources that normally needs to be mined in each research domain Column “Goals”: shows what goals MSR researchers pursue in their research Column “Characteristics of Solutions”: shows common characteristics of MSR solutions
  11. C/C++, PHP and Java are most common studied from ML4S perspective; low abstraction/web apps Vulnerabilities types may play a role in making a model better or worst, SQL injection or XSS 6 levels of granularity – component, file/class, function/method, program slice, statement and commit.
  12. 1.a) Other than using relevant APIs for data extraction, MSR researchers have to implement their pipelines from scratch to extract the knowledge. 1.b) During the MSR research, extracted data needs to be kept up to date to capture the true essence of any change, especially when the source repository is continuously being developed. 1.c) Sometimes the data scraper (code) itself needs to be updated because the owner of the repository might change the data structure (applying changes to the code is very time-consuming) 2.a) Different version control systems follow their unique control parameters; for this reason, pre-processing data from multiple version control systems is very tricky 2.b) Companies and developers use various tools in their environment. These tools may have very similar functionalities, but they have their own unique set of attributes 2.c) the collected data must be checked against quality metrics. For example, the source repository might maintain duplicated data, or the granularity of data is not consistent, sometimes the data has been collected with great details and sometimes with minor details
  13. In vulnerability analytics research, access to vulnerability data is essential. 1.a) for instance: given the number of data fields available in NVD, the searching capability in the official API needs to be more advanced. 1.b) researcher might lose vital information to include in his research because new vulnerabilities are being added to the DB every hour. 1.c) No more updated snapshots of the DB are available from this date, so everyone who needs the data is forced to use their API, and all the previously developed data scrappers will be useless. 2.a) While the data on NVD has been produced by security specialists, many inconsistencies have been observed. For example, the provided list of affected product versions is unreliable. The researcher has to check other resources, such as “release notes” of that affected product to find the accurate vulnerable versions. 3.a) Each vulnerable product (if open-source) on NVD might have a corresponding repository on GitHub. But the external links to that particular commit (e.g., fix commit) are usually not provided. The researcher has to search GitHub manually to find the related repository for performing deeper analysis at the source code level.
  14. [click-1] As you recall, the key tasks we’re investigating at CREST in software security involves data. In the literature, there've been many datasets curated to support development and evaluation of AI-based approaches for vulnerability management. But we also know that "garbage in, garbage out", meaning that if we don't ensure the quality of the input data, we can't guarantee the performance and even validity of the resulting models. Have you ever wondered about the quality of the data we're using for vulnerability management? This question has also motivated us to investigate the quality of data being used for data-driven vulnerability management.
  15. The short answer is that, unfortunately, the current data still has many problems, many of which are not obvious. These problems can negatively affect the data-driven approaches for vulnerability management and security in general. [click-1] And in this talk, I'd like to share these challenges with you. [click-2] And at the end of the presentation, I'd like to share some of the recommendations to avoid or tackle these challenges in the future.
  16. These challenges and recommendations have also been published in the top Transactions on Software Engineering journal.
  17. Through our extensive research, we've encountered 10 key challenges affecting the quality of security/vulnerability data. I'll cover each challenge in the later slides.
  18. For each challenge, I'll focus on these 3 aspects, what, why and so what. I'll first introduce the symptoms of the challenges, then I'll explain why such challenge commonly exists in the field, and finally highlight the potential negative impacts the challenges can have on the models being used for vulnerability management or security in general.
  19. Data scarcity - vulnerability/security dataset is much smaller compared to other domains. Number of security issues ranges from only hundreds to thousands - identifying security issues similar to finding a needle in a haystack. [click-1] One reason can be that security is not always the key focus of developers. Developers tend to focus on other types of bugs; developers’ may lack of training/awareness of the importance of addressing these issues. Many of the current data are curated automatically using tools that are far from perfect. An example for this is the tool to search for missing/hidden vulnerability fixing commits using predefined keyword matching in the commit message. Because a word can have multiple meanings, it's hard to balance between precision (the relevance of the retrieval) and recall (the completeness of the retrieval). It's also worth noting that in nature, some types of security issues are just less frequent to occur in real life. [click-2] imbalanced and lacking data for models to infer sufficient patterns to perform well. For example, from the lower image here, Data on confidentiality, integrity, and availability of a system are rare. Again, it's been shown that having more (high-quality) data is usually better than having the most sophisticated model if we can afford it.
  20. Validity of the data required to avoid inaccurate or mislabeled. Security issues can be mislabeled as non-security ones. Developers may not be aware of security issues. Mislabelling non-security issues as security ones, but from our observations, this is less likely to happen. Vulnerabilities may remain hidden for years For example, a vulnerability introduced in 2011 was fixed in 2015. [click-1] if a vulnerability is not detected, no one knows it exists. No perfect testing oracle or tool to detect all. Lack of reporting may result in a vulnerable code being used without updates. Standardised reporting is needed. When security and non-security issues are fixed together, data milabelling happens - tangled commits. without manual validation, it's very hard to separate them and derive the correct labels. [click-2] Data inaccuracy would make models learn the wrong patterns. . For example, a developer is not aware of a certain type of vulnerability, and thus consistently mislabels the issue of this category. We've shown that these mislabelled cases can change the decision boundaries of the vulnerability prediction models. It's also worth noting that these systematic errors can be exploited by attackers. For instance, attackers can draft code containing the usually mislabelled vulnerabilities to bypass the AI-based detection tool and penetrate the system.
  21. Data co-existing with vulnerable code, but not directly related to vulnerability. [click-1] For example, when labelling security issues from vulnerability fixing commits, some of the files are actually not for fixing any bugs, but for updating versions only instead, as shown in the figure here. Another case is that vulnerabilities mostly only stem from issues in source code, so code comments would not likely to contribute to vulnerabilities. if comments aren't removed, they can make models confused that code comments can also cause vulnerabilities. [click-2] AI modelers not having sufficient domain knowledge to be able to tell what is relevance and what is not. Little effort is spent on data exploratory analysis to first understand the data and problem. [click-3] Construct validity issues as the data is supposed to represent vulnerabilities, but irrelevant data would break this assumption. From the model perspective, as mentioned before, model overfitting to wrong patterns would likely happen when learning from irrelevant data.
  22. A way to increase the number of security issues in projects is to consider more branches in a project or projects – leads to data redundancy. [click-1] When developers create a new branch for developing/testing features, unresolved vulnerabilities in the main/master branch will be carried to the new branch. When collecting vulnerabilities in all the branches of the project, multiple similar copies of vulnerabilities would be included in the dataset. When vulnerabilities are fixed in original project, that vulnerability is usually not fixed in the cloned projects. Data redundancy can make its way to source code through artifact renaming or making cosmetic/non-functional changes such as commenting or adding space/newlines to source code. [click-2] Data redundancy may change the true data distribution. For example, redundancy increases the frequency of certain vulnerability types, making models overfitting/biased towards those cases as in the case of data irrelevance challenge. More seriously, if same samples/security issues exist in both training and testing sets, models will likely predict these cases correctly, artificially increasing the performance - model performance can drop up to 13% after removing these redundant samples.
  23. Vulnerability spanning multiple places is data mis-coverage. Different parts/contexts of vulnerabilities are scattered across multiple functions, files or modules as these parts are inter-dependent (code dependency). Most of the current studies focus on detecting vulnerabilities in single file or function, which fails to capture such code dependencies. For example, in the top figure here, this function is marked as vulnerable. The root cause of this vulnerability is due to unsanitized/unchecked value variable that comes from user's input. However, looking at this function alone, it'd be very hard to know the origin/context and the potential malicious nature of this variable. [click-1] The cause of this challenge is that sometimes fixing commits contain only partial fixes for a vulnerability, which doesn't provide a full picture/context of the vulnerability. From the ML perspective, a finer granularity of prediction would allow researchers to have more data samples to train their models. Similarly for developers, a finer granularity would help reduce inspection effort. However, sometimes, they're not aware of the fact that by doing that, the context of vulnerability is discarded/missing. [click-2] For example, in the lower figure here, a vulnerable module can lead to double and five times more samples if the prediction is performed on the file and function levels, respectively. Missing context doesn't only come from missing code from other functions/files/modules, but also from truncating code in the same function/file/module. This truncation practice is becoming more popular given the rise of recent Deep Learning models. These DL models usually require the input to be fixed-sized and of same length. And again to reduce computational resources, any code longer than the predefined size would be truncated, potentially omitting vulnerability context. [click-3] Missing context would make it more challenging for models to detect complex types of vulnerabilities as the models don't have complete information to identify these vulnerabilities. This is similar to saying a cyclinder looks like a circle from the top, but looks like a rectangle from the side.
  24. Synthetic dataset can increase the size of data but may not be representative of real world vulnerabilities. For example, we found that the ratios of vulnerabilities between synthetic data and real-world data are clearly different. [click-1] Synthetic data is not incorrect, but real-world vulnerabilities are much more complex, involving many dependencies, and thus harder to spot. Limited by the knowledge of known vulnerable patterns, there's no guarantee that synthetic data would be able to represent the nature and complexity of real-world and even unknown vulnerabilities. [click-2] A recent study (Deep Learning based Vulnerability Detection: Are We There Yet?) by Chakraborty and co-workers shows that there's indeed a performance drop of up to 6 times from models trained on synthetic data to those trained on real-world vulnerabilities. The results clearly demonstrate that high-performing models trained on synthetic data tend to lose their generalisability when transferred to real-world scenarios. Thus, the use of synthetic data to increase the size of vulnerability data still requires further research.
  25. The nature and characteristics of vulnerabilities/security issues also likely change, which we refer to as data drift. [click-1] Evolving threat landscapes for example, introduction of new attack vectors and/or new defense mechanisms. The nature of a project can also evolve over time. For instance, the implementation of a project may change from one component to another depending on the focus/requirement of that application at the time. [click-2] Changing textual data would lead to out-of-vocabulary or unseen words for models trained on obsolete data. if the changing nature is not considered when splitting the data, for example, when using k-fold cross-validation, a model may be trained on future data to predict past data, which is unrealistic in real-world settings. We showed that such future data leakage can inflate model performance up to 5 times.
  26. Reproducibility encourages access to the data. [click-1] Ironically, sometimes, the paper provides a link to the dataset, but when we tried to access it, this is what we get (from the top figure). [click-2] Research done in collaboration with a company is usually subjected to the requirement of that company, including not to disclose the collected data. Sometimes data are too large to be stored, so the authors only provided the reduced version of the original data, for example, only sharing the ids of a post instead of the content of whole post. In that case, we'll need to do an extra step to collect the full content by ourselves. However, it's worth noting that this is not always possible since the values of some data change over time, for example, developer's experience (increasing over time). [click-3] Data inaccessibility would limit the reproducibility of the existing results. Inconvenient when we want to compare a new approach with the existing approaches. We also observe that since this accessibility issue is usually more common for closed-source data, making the literature seemingly overly focused on open-source data, and thus less generalisable.
  27. In case the data we need for our research is publicly available in its full form, should we be happy already? Probably yes, but that's not the end of the story yet. Reusing the existing datasets itself can be a challenge. [click-1] the size of dataset can be huge for downloading and storing. For example, Stack Overflow data with more than 20 million posts are more than 50GB or the history data from hundreds of software projects we curated for extracting vulnerability fixing commits from can be up to TBs. Reusing existing data can be a problem, if the authors don’t properly clean their data before sharing. [click-2] Lack of consideration of data challenges in existing data. Specifically, these issues if not detected and mitigated can continue to propagate to our current studies and even future studies, affecting the validity of the findings altogether. Reusing existing data as is can also lead to obsolete and less generalisable results as software data and threat landscape constantly change over time.
  28. Ethical issues of data used in the vulnerability/security domain - Data being misused by bad actors. A scenario that can occur is researchers develop a fancy model to detect zero-day vulnerabilities and then release such vulnerabilities together with other artifacts (code & models). Open science – responsible science [click-1] Vulnerability disclosure is not always a requirement for authors. Open science may be counterproductive Researchers may not fully comprehend the impact of detected vulnerabilities as their focus is on detection rather than exploiting vulnerabilities, but attackers always have bad intention, so a small oversight can become a huge disaster very quickly. [click-2] Violating the ethics of the community. Attackers can exploit the disclosed vulnerabilities and cause damage to both developers and users of the affected systems. This may reduce the trust of the industry community for the research community, which may make it harder to obtain data later on (blocking access).
  29. From our experiments, we found that many real-world vulnerability datasets suffer from various data quality issues. And these issues can severely impact the performance of state-of-the-art vulnerability prediction models. For example, data inaccuracy can decrease the performance by up to 80%. As a follow-up study, we also investigate existing automatic technique to tackle such noise. We found some success so far with up to 20% increase in performance, but we believe that there are still plenty of opportunities for further advancing this new yet important direction.
  30. More focus on positive classes (vulnerable, e.g., bug reports), almost no focus on negative class (non-vulnerable) – full supervised learning needs both of them – additional label sources needed, e.g., static analysis to uncover mis-labelled vulnerabilities. Differential analysis to increase the confidence of statis analysis labels for vulnerability fixes. Relax labelling using semi-supervised methods, such as self-training or Positive and Un-labelled (PU) learning can be applied to help uncover knowledge from the un-labelled set. Timeliness – two types, currency of data means the age of data in use compared with when it was collected. Delayed the data collection efforts. Temporal nature of data accumulation – concept drift. – We need to preserve the data order for model training – data and patterns change over time. Irregular distribution of SVs – detection through visualisation – better understanding, better SVP models Normally datasets for singular languages, semantic and syntactic differences create difficulties for applying models across languages. Bug seeding – import bugs from one language into a dataset of semantically similar language.