How to Troubleshoot Apps for the Modern Connected Worker
Security Data Quality Challenges
1. Building ML Based Software Security Solutions:
“Data Wrangling” Lessons and Recommendations
Ali Babar
CREST – Centre for Research on Engineering Software Technologies
USC, 9th October, 2023
2. • CREST’s Data Centric Software Security Research
• Data-Centric Software Security Quality Assurance
• Data Quality Problems Experienced/Observed
• Some Recommendations to Deal with Challenges
Outline
3.
4.
5. • Approximately 90% of cyber incidents are caused by the
vulnerabilities rooted in software – proprietary or sourced
• Software Bill Of Material (SBOM) is becoming ineffective in
answering critical questions
• Q1: Do we really know what’s in software coming into the organisation?
• Q2: How do we establish trust and preserve the security of software coming
into the organisation?
Software Vulnerabilities and Cybersecurity Incidents
6.
7. Data-Driven Software Security at CREST
Software Vulnerability
Prediction
Software Vulnerability
Assessment &
Prioritisation
Software Vulnerability Knowledge Support
1. A survey on data-driven software vulnerability assessment and prioritization (CSUR '22)
2. On the use of fine-grained vulnerable code statements for software vulnerability
assessment models (MSR '22)
3. An investigation into inconsistency of software vulnerability severity across data sources
(SANER '22)
4. DeepCVA: Automated commit-level vulnerability assessment with deep multi-task
learning (ASE '21)
5. Automated software vulnerability assessment with concept drift (MSR '19)
1. LineVD: statement-level vulnerability detection using graph
neural networks (MSR '22)
2. Noisy label learning for security defects (MSR '22)
3. KGSecConfig: A Knowledge Graph Based Approach for
Secured Container Orchestrator Configuration (SANER '22)
4. An empirical study of rule-based and learning-based
approaches for static application security testing (ESEM '21)
1. An empirical study of developers’ discussions about security challenges of different programming languages (EMSE '22)
2. Well begun is half done: an empirical study of exploitability & impact of base-image vulnerabilities (SANER '22)
3. PUMiner: Mining security posts from developer question and answer websites with PU learning (MSR '20)
4. A large-scale study of security vulnerability support on developer Q&A websites (EASE '21)
8. No perfectly clean dataset of vulnerabilities:
• Label noise
• False positives of data collection
• Constantly discovered new vulnerabilities
• Data noise (e.g., duplicates)
Assumption Reality
GOALS
Robust & noise-tolerant
learning techniques
Data Quality
Assessment & Analysis
Data Quality for Data-Driven Software Security
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
10. • Data Requirements determine the types and source of data for building a model
• “Data Wrangling” (collection, labelling and cleaning) steps of ML Workflow
• “Data Wrangling” (or preparation) can take up to 25% of an industry project time
14
Data Preparation for ML Based Security Solutions
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
11. • Software Vulnerabilities Prediction (SVP) approaches
purport to learn from history and predict SV
• Prediction approaches are becoming popular as early
lifecycle software security assurance techniques
• SVP models may or may not analyse program syntax
and semantic; the latter leverages DL
• Being ML dependent, SVP needs data preparation as
per the workflow of ML shown on the last slide
• SVP data preparation needs several important
consideration including source and labelling
Data-Centric Software Security Assurance
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
12. • Data requirements vary depending upon the context
and capabilities needed of a ML model
• Data may be collected from real-world, synthetic
code or mixed code – training/testing model
• Trade-off between scarcity and realism
• Gathered data need labelling – provided by
developers (NVD), tools based, or based on patterns
• Labelling non-vulnerable class is problematic
• Data cleaning is required for a certain format and
reducing noise from collected/labelled data
Data Preparation Consideration for SVP
18. Data Scarcity
• Hundreds/Thousands security issues vs. Million images
• Security issues < 10% of all reported issues (even worse
for new projects)
Info
Causes
• Lack of explicit labeling/understanding of security issues
• Imperfect data collection (precision vs. recall vs. effort)
• Rare occurence of certain types of security issues
Impacts
• Leading to imbalanced data
• Lacking data to train high-performing ML models
• More data beats a cleverer algorithm
R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447.
T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories (MSR), 2020, 350-361.
19. Data (In)Accuracy
• (Non-)Security issues not labelled as such
• FN: Critical vulnerabilities unfixed for long time (> 3 yrs)
• FP: Wasting inspection effort
Info
Causes
• We don't know what we don't know (unknown unknowns)
• Lack of reporting or silent patch of security issues
• Tangled changes (fixing non-sec. & sec. issues together)
Impacts
• Criticality: Systematic labeling errors > random errors
• Making ML models learn the wrong patterns
• Introducing backdoors of ML models
R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133.
R. Croft, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 435-447.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Without latent SVs With latent SVs
20. Data (Ir)Relevance
• Not all input is useful for predicting security issues
• Ex1: Code comments for predicting vulnerabilities?!
• Ex2: A file containing fixed version update is vulnerable?!
Info
Causes
• Lack of data exploratory analysis
• Lack of domain expertise (e.g., NLPers working in SSE)
• Trying to beat that state-of-the-art
Impacts
• Negatively affecting the construct validity
• Reducing model performance (e.g., code comments
reduced SVP performance by 7x in Python)
Security fixes (87c89f0) in the Apache jspwiki project
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
21. Data Redundancy
• Same security issues found across different software versions,
branches, & even projects
Info
Causes
• Cloned projects from mature projects (e.g., Linux kernel)
• Merged code from feature branches into the master branch
• Renamed files/functions with the same code content
• Cosmetic-only changes (different white spaces or new lines)
Impacts
• Limiting learning capabilities of ML models
• Leading to bias and overfitting for ML models
• Inflating model performance (same training/testing samples)
Thousands of cloned projects
sharing same vulns as the Linux kernel
Vulnerability prediction performance before & after
removing the redundancies in common datasets
R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133.
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
R. Croft, et al., Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, 1-12.
22. Data (Mis)Coverage
• Security issues spanning multiple lines, functions, files,
modules, versions, & even projects, but...
• Current approaches mostly focus on single function/file
Info
Causes
• Partial security fixes in multiple versions
• For ML: coverage vs. size (↑ as granularity ↓)
• For Devs: coverage vs. convenience (inspection effort)
• Fixed-size (truncated) input required for some (DL) models
Impacts
• Lacking context for training ML models to detect complex
(e.g., intra-function/file/module) security issues
• Incomplete data for ML as security-related info. truncated
User's malicious input?
How to know using only
the current function?
Deleted line
Added line
Func_a1 Func_b1
File b
Module A
File a
Func_a2 Func_b2
Func_b3
Assume that Module A
is vulnerable then:
• Vuln. module: 1
• Vuln. files: 2
• Vuln. functions: 5
T. H. M. Le, et al., Proceedings of the IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, 621-633.
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
Cylinder vs. Circle vs. Rectangle?
23. Data (Non-)Representativeness
• Real-world security issues (e.g., NVD) vastly different from
synthetic ones (e.g., SARD)
• Varying characteristics of security issues across projects
Info
Causes
• Synthetic data: local & created by pre-defined rules
• Real-world data: inter-dependent & complex
• Different features & nature between apps
Impacts
• Perf. (F1) gap: Synthetic (0.85) vs. Real-world (0.15)
• Lack of generalisability & transferability of ML models
• Within-project prediction > Cross-project prediction
R. Croft, et al., IEEE Transactions on Software Engineering, 2022.
S. Chakraborty, et al., IEEE Transactions on Software Engineering, 2021.
Synthetic
data
Real-world
data
24. Data Drift
• Unending battle between attackers & defenders
• Evolving threat landscapes Changing characteristics
of security issues over time
Info
Causes
• New terms for emerging attacks, defenses, & issues
• Changing software features & implementation over time
Impacts
• Out-of-Vocabulary words Degrading performance
• Data leakage Unrealistic performance (up to ~5 times
overfitting) using non-temporal evaluation technique
R. Croft et al., Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, 121-133.
R. Croft, et al., Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, 338-348.
T. H. M. Le, et al., Proceedings of the 16th International Conference on Mining Software Repositories, 2019, 371-382.
New threats/affected products in NVD over time
Test
Non-temporal evaluation
(Future) data leakage
25. Data (In)Accessibility
• Security data not always shared
• Shared yet incomplete data Re-collected data may be
different from original data
Info
Causes
• Privacy concerns (e.g., commercial projects) or not?!
• Too large size for storage (artifact ID vs. artifact content)
• Data values can change over time (e.g., devs' experience)
Impacts
• Limited reproducibility of the existing results
• Limited generalisability of the ML models (e.g., open-
source vs. closed-source data)
"We have anonymously uploaded the database
to https://www.dropbox.com/s/anonymised_link
so the reviewers can access the raw data
during the review process. We will release the
data to the community together with the paper."
Click the link
R. Croft, et al., Empirical Software Engineering, 2022, 27: 1-52.
T. H. M. Le, et al., Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2021, 109-118.
26. Data (Re-)Use
• Finding a needle (security issue) in a haystack (raw data)
• And ... haystack can be huge
• Reuse existing data > Update/collect new data
Info
Causes
• Completely trusting/relying on existing datasets
• Unsuitable infrastructure to collect/store raw data
Impacts
• Making ML models become obsolete & less generalised
• Being unaware of the issues in the current data
Error propagation in following studies
1,229 VCCs
(> 1TB)
200 real-world
Java projects
Clone
projects
~70k security posts
~20 mil posts (> 50 GB)
T. H. M. Le, et al., Proceedings of the 36th International Conference on Automated Software Engineering (ASE), 2021, 717-729.
T. H. M. Le, et al., Proceedings of the 17th International Conference on Mining Software Repositories, 2020, 350-361.
27. Data Maliciousness
• Threat/security data is itself a threat (e.g., new vulns)
• Using/sharing threat data without precautions
Info
Causes
• Simply an oversight / afterthought
• Private experiment vs. Public disclosure (Big difference!)
• Open science vs. Responsible science
Impacts
• Violating ethics of the community
• Shared security data maybe exploited by attackers
• Affecting maintainers & users of target systems
The identified vulns later get exploited by attackers
They move on to submit & publish the paper
They don't report the vulnerabilities to project
maintainers to fix
They identify new vulnerabilities using the model
Researchers develop a SOTA ML-based
vulnerability prediction model
An example of malicious/irresponsible data sharing
28. Slide 35
Dataset Accuracy Uniqueness Consistency Completeness Currentness
Big-Vul 0.543 0.830 0.999 0.824 0.761
Devign 0.800 0.899 0.991 0.944 0.811
D2A 0.286 0.021 0.531 0.981 0.844
Significantly reduce SVP performance, e.g., data accuracy (30 – 80% ↓ in MCC)
Automatic data cleaning: ↑ performance by ~20%, but still far from perfect
Prevalent noise in current real-world vulnerability datasets
1. Data Quality for Software Vulnerability Datasets, ICSE '23 (CORE A*)
2. Data preparation for software vulnerability prediction: A systematic literature review, TSE '22 (A*)
3. Noisy label learning for security defects, MSR '22 (CORE A)
4. An investigation into inconsistency of software vulnerability severity across data sources, SANER '22 (CORE A)
Data Quality for Data-Driven Software Security
30. • Identification of Missing Vulnerability Data
• Automatic labeling of silent fixes & latent vulnerabilities (beware of false positives)
• Consideration of Label Noise
• Noisy Label Learning and/or Semi-Supervised Learning (small clean data & large unlabelled data)
• Consideration of Timeliness
• Currently labeled data & more positive samples; Preserve data sequence for training
• Use of Data Visualization
• Try to achieve better data understandability for non data scientists
• Creation and Use of Diverse Language Datasets
• Bug seeding into semantically similar languages
• Use of Data Quality Assessment Criteria
• Determine and use specific data quality assessment approaches
• Better Data Sharing and Governance
• Provide exact details and processes of data preparation
Recommendations for Dealing with Data Quality Issues
31. Acknowledgements
• This talk is based on the research studies carried out by the CREST researchers,
particularly by Roland Croft, Triet Le, Yongzheng Xie, Mehdi Kholoosi.
• Triet Le led the efforts for developing the content for this presentation
• Discussions in the SSI cluster of the CREST provided insights included in this presentation
• Our research partially funded by the Cybersecurity CRC
• We are grateful to the students, RAs and Open Source data communities
Our R&D is focused on addressing the most pressing challenges of developing and evaluating new knowledge and tools for engineering software intensive systems with/for IoT, Cloud computing, Cyber physical systems.
Our group carries out top quality research on human-centric software security - excellence and industry impact.
We have developed significant capabilities in digital twins, defence systems, and AgFood systems as core domain –
The presentation is based on our studies in one of the key areas of our R&D – Data-Driven Software Security Solutions.
To start with, I'll show you these icons that many of you would recognize. They are real-world mission systems of many large tech companies in the world.
Software underpinning such mission systems has to be trustworthy and secure – security and privacy are key concerns nowadays.
Software underpinning such apps can also contain potential security risks (vulnerabilities); increasingly ML based approaches are being used for detecting software security risks – such solutions are inherently reliant on data.
For example, last year, you may have heard about the Log4J vulnerability, which took the entire Internet by storm.
Log4J vulnerability’s exploitation could have affected millions of systems around the world.
And this’s just one example among the thousands of critical vulnerabilities discovered every single year. So you can see how much damage software vulnerabilities can cause if we don’t detect and address them on time.
Actually, 90% of cyber incidents are caused by software vulnerabilities.
White House EO 14028, “enhancing software supply chain security
Can SBOM alone solve this problem?
We are developing tools and techniques and distilling practices to provide early information about software vulnerabilities for both expert and non-expert users.
We leverage various data sources to develop AI-powered techniques to automatically extract insights into the whole vulnerability lifecycle, ranging from early detection of these vulnerabilities, to assessing them in terms of their probability of exploitation and impacts, and then giving recommendations to developers to plan and prioritize the mitigation and fixing.
And currently, our research targets both traditional and contemporary AI-based systems as well as the supporting software infrastructure such as Docker containers for these systems.
So as you can see from the previous slide, data-driven software security is one of the key research areas at CREST.
So as you can see from the previous slide, data-driven software security is one of the key research areas at CREST. Here’re the A/A* publications we had in the area grouped by relevant tasks.
To develop SVP models, the quality of input data is a very integral requirement.
In reality, it's very challenging to have a perfectly clean dataset of all vulnerabilities as current data collection heuristics are still far from perfect and new vulnerabilities are constantly being discovered, including in existing code.
This means that we're uncertain about the actual labels of many samples. With such noisy and imperfect data, it's very difficult to develop reliable models.
[click-1] Our research activities are focused on understanding and addressing data quality issues in order to make the downstream data-driven models more reliable and trustworthy.
We have also design/dveloped robust learning techniques to make the models more tolerant to data noise and imperfections. However, that is not the focus of today’s talk.
We reported the results of this SLR as a TSE journal paper.
Column “Research Domain”: shows the current hot topics in MSR research
Column “Data Sources”: shows the data sources that normally needs to be mined in each research domain
Column “Goals”: shows what goals MSR researchers pursue in their research
Column “Characteristics of Solutions”: shows common characteristics of MSR solutions
C/C++, PHP and Java are most common studied from ML4S perspective; low abstraction/web apps
Vulnerabilities types may play a role in making a model better or worst, SQL injection or XSS
6 levels of granularity – component, file/class, function/method, program slice, statement and commit.
1.a) Other than using relevant APIs for data extraction, MSR researchers have to implement their pipelines from scratch to extract the knowledge.
1.b) During the MSR research, extracted data needs to be kept up to date to capture the true essence of any change, especially when the source repository is continuously being developed.
1.c) Sometimes the data scraper (code) itself needs to be updated because the owner of the repository might change the data structure (applying changes to the code is very time-consuming)
2.a) Different version control systems follow their unique control parameters; for this reason, pre-processing data from multiple version control systems is very tricky
2.b) Companies and developers use various tools in their environment. These tools may have very similar functionalities, but they have their own unique set of attributes
2.c) the collected data must be checked against quality metrics. For example, the source repository might maintain duplicated data, or the granularity of data is not consistent, sometimes the data has
been collected with great details and sometimes with minor details
In vulnerability analytics research, access to vulnerability data is essential.
1.a) for instance: given the number of data fields available in NVD, the searching capability in the official API needs to be more advanced.
1.b) researcher might lose vital information to include in his research because new vulnerabilities are being added to the DB every hour.
1.c) No more updated snapshots of the DB are available from this date, so everyone who needs the data is forced to use their API, and all the previously developed data scrappers will be useless.
2.a) While the data on NVD has been produced by security specialists, many inconsistencies have been observed. For example, the provided list of affected product versions is unreliable. The researcher has to check other resources, such as “release notes” of that affected product to find the accurate vulnerable versions.
3.a) Each vulnerable product (if open-source) on NVD might have a corresponding repository on GitHub. But the external links to that particular commit (e.g., fix commit) are usually not provided. The researcher has to search GitHub manually to find the related repository for performing deeper analysis at the source code level.
[click-1] As you recall, the key tasks we’re investigating at CREST in software security involves data. In the literature, there've been many datasets curated to support development and evaluation of AI-based approaches for vulnerability management.
But we also know that "garbage in, garbage out", meaning that if we don't ensure the quality of the input data, we can't guarantee the performance and even validity of the resulting models.
Have you ever wondered about the quality of the data we're using for vulnerability management? This question has also motivated us to investigate the quality of data being used for data-driven vulnerability management.
The short answer is that, unfortunately, the current data still has many problems, many of which are not obvious. These problems can negatively affect the data-driven approaches for vulnerability management and security in general.
[click-1] And in this talk, I'd like to share these challenges with you.
[click-2] And at the end of the presentation, I'd like to share some of the recommendations to avoid or tackle these challenges in the future.
These challenges and recommendations have also been published in the top Transactions on Software Engineering journal.
Through our extensive research, we've encountered 10 key challenges affecting the quality of security/vulnerability data. I'll cover each challenge in the later slides.
For each challenge, I'll focus on these 3 aspects, what, why and so what.
I'll first introduce the symptoms of the challenges, then I'll explain why such challenge commonly exists in the field, and finally highlight the potential negative impacts the challenges can have on the models being used for vulnerability management or security in general.
Data scarcity - vulnerability/security dataset is much smaller compared to other domains.
Number of security issues ranges from only hundreds to thousands - identifying security issues similar to finding a needle in a haystack.
[click-1] One reason can be that security is not always the key focus of developers. Developers tend to focus on other types of bugs; developers’ may lack of training/awareness of the importance of addressing these issues.
Many of the current data are curated automatically using tools that are far from perfect. An example for this is the tool to search for missing/hidden vulnerability fixing commits using predefined keyword matching in the commit message. Because a word can have multiple meanings, it's hard to balance between precision (the relevance of the retrieval) and recall (the completeness of the retrieval).
It's also worth noting that in nature, some types of security issues are just less frequent to occur in real life.
[click-2] imbalanced and lacking data for models to infer sufficient patterns to perform well. For example, from the lower image here, Data on confidentiality, integrity, and availability of a system are rare.
Again, it's been shown that having more (high-quality) data is usually better than having the most sophisticated model if we can afford it.
Validity of the data required to avoid inaccurate or mislabeled. Security issues can be mislabeled as non-security ones. Developers may not be aware of security issues.
Mislabelling non-security issues as security ones, but from our observations, this is less likely to happen.
Vulnerabilities may remain hidden for years For example, a vulnerability introduced in 2011 was fixed in 2015.
[click-1] if a vulnerability is not detected, no one knows it exists. No perfect testing oracle or tool to detect all.
Lack of reporting may result in a vulnerable code being used without updates. Standardised reporting is needed.
When security and non-security issues are fixed together, data milabelling happens - tangled commits. without manual validation, it's very hard to separate them and derive the correct labels.
[click-2] Data inaccuracy would make models learn the wrong patterns. . For example, a developer is not aware of a certain type of vulnerability, and thus consistently mislabels the issue of this category. We've shown that these mislabelled cases can change the decision boundaries of the vulnerability prediction models. It's also worth noting that these systematic errors can be exploited by attackers. For instance, attackers can draft code containing the usually mislabelled vulnerabilities to bypass the AI-based detection tool and penetrate the system.
Data co-existing with vulnerable code, but not directly related to vulnerability.
[click-1] For example, when labelling security issues from vulnerability fixing commits, some of the files are actually not for fixing any bugs, but for updating versions only instead, as shown in the figure here. Another case is that vulnerabilities mostly only stem from issues in source code, so code comments would not likely to contribute to vulnerabilities. if comments aren't removed, they can make models confused that code comments can also cause vulnerabilities.
[click-2] AI modelers not having sufficient domain knowledge to be able to tell what is relevance and what is not. Little effort is spent on data exploratory analysis to first understand the data and problem.
[click-3] Construct validity issues as the data is supposed to represent vulnerabilities, but irrelevant data would break this assumption. From the model perspective, as mentioned before, model overfitting to wrong patterns would likely happen when learning from irrelevant data.
A way to increase the number of security issues in projects is to consider more branches in a project or projects – leads to data redundancy.
[click-1] When developers create a new branch for developing/testing features, unresolved vulnerabilities in the main/master branch will be carried to the new branch. When collecting vulnerabilities in all the branches of the project, multiple similar copies of vulnerabilities would be included in the dataset. When vulnerabilities are fixed in original project, that vulnerability is usually not fixed in the cloned projects.
Data redundancy can make its way to source code through artifact renaming or making cosmetic/non-functional changes such as commenting or adding space/newlines to source code.
[click-2] Data redundancy may change the true data distribution. For example, redundancy increases the frequency of certain vulnerability types, making models overfitting/biased towards those cases as in the case of data irrelevance challenge.
More seriously, if same samples/security issues exist in both training and testing sets, models will likely predict these cases correctly, artificially increasing the performance - model performance can drop up to 13% after removing these redundant samples.
Vulnerability spanning multiple places is data mis-coverage. Different parts/contexts of vulnerabilities are scattered across multiple functions, files or modules as these parts are inter-dependent (code dependency).
Most of the current studies focus on detecting vulnerabilities in single file or function, which fails to capture such code dependencies.
For example, in the top figure here, this function is marked as vulnerable. The root cause of this vulnerability is due to unsanitized/unchecked value variable that comes from user's input. However, looking at this function alone, it'd be very hard to know the origin/context and the potential malicious nature of this variable.
[click-1] The cause of this challenge is that sometimes fixing commits contain only partial fixes for a vulnerability, which doesn't provide a full picture/context of the vulnerability. From the ML perspective, a finer granularity of prediction would allow researchers to have more data samples to train their models. Similarly for developers, a finer granularity would help reduce inspection effort. However, sometimes, they're not aware of the fact that by doing that, the context of vulnerability is discarded/missing.
[click-2] For example, in the lower figure here, a vulnerable module can lead to double and five times more samples if the prediction is performed on the file and function levels, respectively.
Missing context doesn't only come from missing code from other functions/files/modules, but also from truncating code in the same function/file/module. This truncation practice is becoming more popular given the rise of recent Deep Learning models. These DL models usually require the input to be fixed-sized and of same length. And again to reduce computational resources, any code longer than the predefined size would be truncated, potentially omitting vulnerability context.
[click-3] Missing context would make it more challenging for models to detect complex types of vulnerabilities as the models don't have complete information to identify these vulnerabilities. This is similar to saying a cyclinder looks like a circle from the top, but looks like a rectangle from the side.
Synthetic dataset can increase the size of data but may not be representative of real world vulnerabilities. For example, we found that the ratios of vulnerabilities between synthetic data and real-world data are clearly different.
[click-1] Synthetic data is not incorrect, but real-world vulnerabilities are much more complex, involving many dependencies, and thus harder to spot.
Limited by the knowledge of known vulnerable patterns, there's no guarantee that synthetic data would be able to represent the nature and complexity of real-world and even unknown vulnerabilities.
[click-2] A recent study (Deep Learning based Vulnerability Detection: Are We There Yet?) by Chakraborty and co-workers shows that there's indeed a performance drop of up to 6 times from models trained on synthetic data to those trained on real-world vulnerabilities.
The results clearly demonstrate that high-performing models trained on synthetic data tend to lose their generalisability when transferred to real-world scenarios. Thus, the use of synthetic data to increase the size of vulnerability data still requires further research.
The nature and characteristics of vulnerabilities/security issues also likely change, which we refer to as data drift.
[click-1] Evolving threat landscapes for example, introduction of new attack vectors and/or new defense mechanisms. The nature of a project can also evolve over time. For instance, the implementation of a project may change from one component to another depending on the focus/requirement of that application at the time.
[click-2] Changing textual data would lead to out-of-vocabulary or unseen words for models trained on obsolete data.
if the changing nature is not considered when splitting the data, for example, when using k-fold cross-validation, a model may be trained on future data to predict past data, which is unrealistic in real-world settings. We showed that such future data leakage can inflate model performance up to 5 times.
Reproducibility encourages access to the data.
[click-1] Ironically, sometimes, the paper provides a link to the dataset, but when we tried to access it, this is what we get (from the top figure).
[click-2] Research done in collaboration with a company is usually subjected to the requirement of that company, including not to disclose the collected data.
Sometimes data are too large to be stored, so the authors only provided the reduced version of the original data, for example, only sharing the ids of a post instead of the content of whole post. In that case, we'll need to do an extra step to collect the full content by ourselves. However, it's worth noting that this is not always possible since the values of some data change over time, for example, developer's experience (increasing over time).
[click-3] Data inaccessibility would limit the reproducibility of the existing results. Inconvenient when we want to compare a new approach with the existing approaches.
We also observe that since this accessibility issue is usually more common for closed-source data, making the literature seemingly overly focused on open-source data, and thus less generalisable.
In case the data we need for our research is publicly available in its full form, should we be happy already? Probably yes, but that's not the end of the story yet. Reusing the existing datasets itself can be a challenge.
[click-1] the size of dataset can be huge for downloading and storing. For example, Stack Overflow data with more than 20 million posts are more than 50GB or the history data from hundreds of software projects we curated for extracting vulnerability fixing commits from can be up to TBs.
Reusing existing data can be a problem, if the authors don’t properly clean their data before sharing.
[click-2] Lack of consideration of data challenges in existing data. Specifically, these issues if not detected and mitigated can continue to propagate to our current studies and even future studies, affecting the validity of the findings altogether.
Reusing existing data as is can also lead to obsolete and less generalisable results as software data and threat landscape constantly change over time.
Ethical issues of data used in the vulnerability/security domain - Data being misused by bad actors.
A scenario that can occur is researchers develop a fancy model to detect zero-day vulnerabilities and then release such vulnerabilities together with other artifacts (code & models).
Open science – responsible science
[click-1] Vulnerability disclosure is not always a requirement for authors. Open science may be counterproductive
Researchers may not fully comprehend the impact of detected vulnerabilities as their focus is on detection rather than exploiting vulnerabilities, but attackers always have bad intention, so a small oversight can become a huge disaster very quickly.
[click-2] Violating the ethics of the community. Attackers can exploit the disclosed vulnerabilities and cause damage to both developers and users of the affected systems. This may reduce the trust of the industry community for the research community, which may make it harder to obtain data later on (blocking access).
From our experiments, we found that many real-world vulnerability datasets suffer from various data quality issues. And these issues can severely impact the performance of state-of-the-art vulnerability prediction models.
For example, data inaccuracy can decrease the performance by up to 80%.
As a follow-up study, we also investigate existing automatic technique to tackle such noise. We found some success so far with up to 20% increase in performance, but we believe that there are still plenty of opportunities for further advancing this new yet important direction.
More focus on positive classes (vulnerable, e.g., bug reports), almost no focus on negative class (non-vulnerable) – full supervised learning needs both of them – additional label sources needed, e.g., static analysis to uncover mis-labelled vulnerabilities. Differential analysis to increase the confidence of statis analysis labels for vulnerability fixes.
Relax labelling using semi-supervised methods, such as self-training or Positive and Un-labelled (PU) learning can be applied to help uncover knowledge from the un-labelled set.
Timeliness – two types, currency of data means the age of data in use compared with when it was collected. Delayed the data collection efforts.
Temporal nature of data accumulation – concept drift. – We need to preserve the data order for model training – data and patterns change over time.
Irregular distribution of SVs – detection through visualisation – better understanding, better SVP models
Normally datasets for singular languages, semantic and syntactic differences create difficulties for applying models across languages. Bug seeding – import bugs from one language into a dataset of semantically similar language.