SlideShare a Scribd company logo
1 of 34
Download to read offline
.lusoftware verification & validation
VVS .lusoftware verification & validation
VVS
Synthetic Data Generation for
Statistical Testing
Ghanem Soltana, Mehrdad Sabetzadeh, and Lionel C. Briand
SnT Centre for Security, Reliability and Trust
University of Luxembourg, Luxembourg
ASE 2017
Context and
motivation
2
Context
3
• Collaboration with
the Government of
Luxembourg
- CTIE: Government’s IT Centre
• New tax system under development: operationalization of
the administrative procedures envisaged by the tax law
• System needs to be reliable
4
Reliability requirements
• Most data-centric systems, such as administration and financial
systems, are subject to reliability requirements
- Example of a reliability requirement for the tax system: the probability
that a failure occurs in tax return calculation shall not exceed 10-3
5
Usage-based statistical testing
• Testing driven by the expected usage of a system
• Mimic system behavior in realistic circumstances
• Aimed at assessing system reliability and prioritizing failures that
are more likely to occur during operation
• Uses operational profiles to characterize current or anticipated
system usage
-For example, an operation profile can be a set of operations
and their associated probabilities of occurrence
6
System
under test
Input to
Test data
Observed behavior/
results
Detect and
analyze
failures
Does system meet
its reliability
requirements?
✗
Statistical testing for data-centric systems
When possible, it is
more practical to
use real data
7
Motivation for generating synthetic data
• Real data might have gaps and structural mismatches (when
compared to the data that will be processed by the system)
• Access to real data might be restricted
• Real data might be non-existent!
Test data
Synthetic test data requirements
Real Synthetic
8
• Child care tax deduction example
• Test data should be “statistically
representative” (realistic circumstances)
Should be
representative of the
actual or anticipated
system usage
• Child care tax deduction example
• Such data help reasoning about
robustness rather than reliability
Test data
Must be logically and
structurally well-formed
to detect meaningful
failures
Synthetic test data requirements
Real Synthetic
9
{Logically
invalid
- receives_child_allowance = true
T1: ResidentTaxPayer
C1: Child
- receives_child_allowance = false
T1: ResidentTaxPayer
C1: Child
- receives_child_allowance = true
T1: ResidentTaxPayer
- receives_child_allowance = false
T1: ResidentTaxPayer
10
Objective
Test data for statistical testing of data-centric systems
Valid Representative
How to automatically generate test data
that meet both requirements at the same
time in a scalable manner?
11
Data generation at a glance
• Validity
• Representativeness
Exhaustive search Metaheuristic-search
- Constraint programming
[Cabot et al., JSS 2014]
- Alloy
[Sen et al., ICMT 2019]
- Alternating Variable Method (AVM)
[Ali et al., TSE 2013]
[Ali et al., ESE 2016]
Heuristics Sampling
- Rule-based
[Hartmann et al., SmartGridComm 2014]
- Model-based
[Soltana et al., SoSyM 2016]
- Boltzmann's random sampling
[Mougenot et al., ECMDA-FA 2009]
ApproachApproach
12
13
Approach overview
Synthetic
data sample
(test suite)
Data schema
(class diagram)
2. Define
statistical
characteristics
Annotated
data schema
<<s>>
<<p>>
<<p>>
<<m>>
3. Define
data validity
constraints Constraints
(explicitly defined)
4. Generate
synthetic data
OCL
1. Define
data schema
14
Approach overview
Synthetic
data sample
(test suite)
Data schema
(class diagram)
2. Define
statistical
characteristics
Annotated
data schema
<<s>>
<<p>>
<<p>>
<<m>>
3. Define
data validity
constraints Constraints
(explicitly defined)
4. Generate
synthetic data
OCL
1. Define
data schema
15
The statistical characteristics of the data
[Soltana et al., SoSyM 2016]
• We use a UML profile for defining the statistical characteristics of
the data
• The profile is composed of a set of statistical annotations
(stereotypes)
UML profiles are a standard
mechanism to extend UML models
with additional modeling concepts
Relative frequencies
60% of income types are Employment, 20% are Pension, and
the remaining 20% are Other
Income
Employment
«probabilistic type»
{frequency: 0.6}
Pension
«probabilistic type»
{frequency: 0.2}
Other
«probabilistic type»
{frequency: 0.2}
(abstract)
[Soltana et al., SoSyM 2016]
16
17
Approach overview
Synthetic
data sample
(test suite)
Data schema
(class diagram)
2. Define
statistical
characteristics
Annotated
data schema
<<s>>
<<p>>
<<p>>
<<m>>
3. Define
data validity
constraints Constraints
(explicitly defined)
4. Generate
synthetic data
OCL
1. Define
data schema
18
Define data validity constraints
• We use OCL to define data validity constraints
• Some of the constraints are derived from the class diagram and
its annotations
• The remaining constraints are explicitly defined by users
context Physical_Person inv implicit_annotations:
self.birth_year <= 2017 and self.birth_year >= 1917
context Physical_Person inv explicite1:
self.children->forAll(c| c.birth_year > self.birth_year + 16)
19
Approach Overview
Synthetic
data sample
(test suite)
Data schema
(class diagram)
2. Define
statistical
characteristics
Annotated
data schema
<<s>>
<<p>>
<<p>>
<<m>>
3. Define
data validity
constraints Constraints
(explicitly defined)
4. Generate
synthetic data
OCL
1. Define
data schema
20
Our strategy for generating data for statistical testing
Phase 1: Enforce all OCL
validity constraints
All validity
constraints
OCL
Data schema
annotated with
statistical distributions
<<s>>
<<p>>
<<p>>
<<m>>
Valid data
sample
Final data
sample
Phase 2: Attempt to improve
representativeness while
maintaining validity
21
Phase 1: generate logically valid test data
Create seed
sample
Create valid
sample
Seed sample
with potential
logical anomalies
Valid data
sample
All validity constraintsOCL
Uses our heuristic data
generator for producing
representative (but
invalid) data sample
Uses a customized
search-based OCL
solver
• The valid sample might not end up too far away from being representative,
in turn making it easier to fix deviations from representativeness in phase 2
22
Phase 2: generate valid and representative test data
Valid data
sample All validity constraints
Generate
corrective
constraints
Propose
tweaked
instance model
Corrective constraints
Tweaked instance model
Final data
sample
(for each instance model in the sample)
OCL
Uses the same
search-based
OCL solver
Provides cues, via “soft” OCL
constraints, on how an instance
model should be tweaked so that
the representativeness of the whole
data sample is improved
23
Example of a corrective constraint
-
+
-
Current distribution of the sample Desired distribution
An instance model
Income.allInstances()->
select(num = 1)->forAll(
(type <> IncomeTypes::Employment)
and
(type = IncomeTypes::Pension or
type = IncomeTypes::Other))
Corresponding corrective constraint
- id = 1
T1: ResidentTaxPayer
I1: Income
- num = 1
- type = Employment
24
Quantifying and improving representativeness
• We decide whether a (valid) tweaked instance model should replace the
original one in the sample using the Euclidian Distance (ED)
Desired distribution
Vs.
Current distribution of the sample
Desired distribution
Vs.
Distribution of the sample if the tweaked
instance model replaces the original one
Has a better
alignment (thus
a lower ED)
Case studyEvaluation
25
26
Case study
• Statistical testing of reliability requirements of a re-designed tax
management system in Luxembourg
• Using actual data is not option:
- Actual data is sensitive and of a personal nature, sharing the
data with third-parties poses complications
- There are structural mismatches between the data schema used
by the system under test and the actual historical data
27
Case study inputs
• Data schema (UML class diagram)
• Statistical characteristics of the data items
• 68 OCL validity constraints
Classe Enum. Association Generalization Attribute
#elements 64 17 53 43 344
Histogram Cond. dist. Other
# 15 7 13
Age
Householdsize
Residence status
…
Income for pensioners
Foreign income types
…
STATEC
Uniform distribution
for the day of the
year on which
individuals are born
…
Let OCL op. Nested if-then-else Quantifier Logical op.
# 64 17 10 7 212
28
Research questions
• RQ1: Does our synthetic data generator run in practical time?
• RQ2: Can our approach generate data samples that are both valid
and statistically representative?
29
RQ1: execution time of the data generator
• We report on the average execution time (5 times) of our data generator for
building data samples of different sizes, ranging from 100 to 1000
• Our data generator could produce data samples with up to 1000 instance
models (representing tax cases) in less than 10 hours
- Execution time for generating initial (invalid) solutions is negligible
- Constraint solving accounts on average for 85% of the execution time
- Generating corrective constraints accounts on average for 14% of the
execution time
30
RQ1: execution time of the data generator
This execution time is practical in our context since:
• Data generation can be performed overnight.
• For more complex systems, parallelization of search during constraint
solving can be considered.
• Test data generation can be initiated well in advance of the testing
phase, and as soon as the data profile for the system under test has
stabilized.
31
Research questions
• RQ1: Does our synthetic data generator run in practical time?
• RQ2: Can our approach generate data samples that are both valid
and statistically representative?
32
RQ2: quality of the generated data
• Average (5 times) distances between the statistical distributions in a given
sample and the corresponding desired distributions
• Three distance metrics: Euclidean, Manhattan and Canberra metrics
Data samples created by our data generator are valid, and at the same time,
surpassing the state-of-the-art in terms of representativeness
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
100 200 300 400 500 600 700 800 900 1000
Euclideandistance
Number of instance models in the sample
Distance for (invalid)
seed sample
Distance for final
valid sample
33
Summary
• A model-based data generator for supporting statistical testing of
data-centric systems
• The data generator produces high-quality test data in a practical
time:
- Produced test data samples are representative of with the
actual or anticipated system usage
- Produced test data samples are structurally and logically valid
- The generator is able to produce up to 1000 instance models
(here, tax cases) in less than 10 hours
.lusoftware verification & validation
VVS .lusoftware verification & validation
VVS
Synthetic Data Generation for
Statistical Testing
Ghanem Soltana
ghanem.soltana@uni.lu
Tool available at http://people.svv.lu/tools/SDG/
SnT Centre for Security, Reliability and Trust
University of Luxembourg, Luxembourg
ASE 2017

More Related Content

What's hot

Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUnity Technologies
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Dataconomy Media
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksDing Li
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNNNoura Hussein
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and ApplicationsHoang Nguyen
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Predicting house price
Predicting house pricePredicting house price
Predicting house priceDivya Tiwari
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applicationsAnas Arram, Ph.D
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...Neo4j
 

What's hot (20)

Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model training
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Generative models
Generative modelsGenerative models
Generative models
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and Applications
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Predicting house price
Predicting house pricePredicting house price
Predicting house price
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...
 

Similar to Synthetic Data Generation for Statistical Testing

A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...Gurdal Ertek
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Alan Walker
 
Chap_05_Data_Collection_and_Analysis.ppt
Chap_05_Data_Collection_and_Analysis.pptChap_05_Data_Collection_and_Analysis.ppt
Chap_05_Data_Collection_and_Analysis.pptRosaHildaFlix
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainChelsea Frischknecht
 
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of CambridgeData quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of CambridgeBCS Data Management Specialist Group
 
Introduction to System, Simulation and Model
Introduction to System, Simulation and ModelIntroduction to System, Simulation and Model
Introduction to System, Simulation and ModelMd. Hasan Imam Bijoy
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Institute of Contemporary Sciences
 
Discreate Event Simulation_PPT1-R0.ppt
Discreate Event Simulation_PPT1-R0.pptDiscreate Event Simulation_PPT1-R0.ppt
Discreate Event Simulation_PPT1-R0.pptdiklatMSU
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08 Jeet Das
 
Discrete event simulation
Discrete event simulationDiscrete event simulation
Discrete event simulationssusera970cc
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithmsFarhan Zaki
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsIstituto nazionale di statistica
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...ESEM 2014
 

Similar to Synthetic Data Generation for Statistical Testing (20)

A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
 
Chap_05_Data_Collection_and_Analysis.ppt
Chap_05_Data_Collection_and_Analysis.pptChap_05_Data_Collection_and_Analysis.ppt
Chap_05_Data_Collection_and_Analysis.ppt
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated Pain
 
Manufacturing Data Analytics
Manufacturing Data AnalyticsManufacturing Data Analytics
Manufacturing Data Analytics
 
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of CambridgeData quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Introduction to System, Simulation and Model
Introduction to System, Simulation and ModelIntroduction to System, Simulation and Model
Introduction to System, Simulation and Model
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
 
Discreate Event Simulation_PPT1-R0.ppt
Discreate Event Simulation_PPT1-R0.pptDiscreate Event Simulation_PPT1-R0.ppt
Discreate Event Simulation_PPT1-R0.ppt
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
Simulation
SimulationSimulation
Simulation
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
Discrete event simulation
Discrete event simulationDiscrete event simulation
Discrete event simulation
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
 

More from Lionel Briand

Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityLionel Briand
 
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Lionel Briand
 
Fuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingFuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingLionel Briand
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsLionel Briand
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsLionel Briand
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...Lionel Briand
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Lionel Briand
 
PRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsPRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsLionel Briand
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingLionel Briand
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Lionel Briand
 
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyAutonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyLionel Briand
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Lionel Briand
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationLionel Briand
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Lionel Briand
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...Lionel Briand
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Lionel Briand
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Lionel Briand
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...Lionel Briand
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsLionel Briand
 
Practical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test DataPractical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test DataLionel Briand
 

More from Lionel Briand (20)

Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System Security
 
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
 
Fuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingFuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation Testing
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical Systems
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
 
PRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsPRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System Logs
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software Testing
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
 
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyAutonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and Applications
 
Practical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test DataPractical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test Data
 

Recently uploaded

Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfBrain Inventory
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageDista
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdfMeon Technology
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptkinjal48
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native BuildpacksVish Abrams
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 

Recently uploaded (20)

Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdf
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdf
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.ppt
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native Buildpacks
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 

Synthetic Data Generation for Statistical Testing

  • 1. .lusoftware verification & validation VVS .lusoftware verification & validation VVS Synthetic Data Generation for Statistical Testing Ghanem Soltana, Mehrdad Sabetzadeh, and Lionel C. Briand SnT Centre for Security, Reliability and Trust University of Luxembourg, Luxembourg ASE 2017
  • 3. Context 3 • Collaboration with the Government of Luxembourg - CTIE: Government’s IT Centre • New tax system under development: operationalization of the administrative procedures envisaged by the tax law • System needs to be reliable
  • 4. 4 Reliability requirements • Most data-centric systems, such as administration and financial systems, are subject to reliability requirements - Example of a reliability requirement for the tax system: the probability that a failure occurs in tax return calculation shall not exceed 10-3
  • 5. 5 Usage-based statistical testing • Testing driven by the expected usage of a system • Mimic system behavior in realistic circumstances • Aimed at assessing system reliability and prioritizing failures that are more likely to occur during operation • Uses operational profiles to characterize current or anticipated system usage -For example, an operation profile can be a set of operations and their associated probabilities of occurrence
  • 6. 6 System under test Input to Test data Observed behavior/ results Detect and analyze failures Does system meet its reliability requirements? ✗ Statistical testing for data-centric systems When possible, it is more practical to use real data
  • 7. 7 Motivation for generating synthetic data • Real data might have gaps and structural mismatches (when compared to the data that will be processed by the system) • Access to real data might be restricted • Real data might be non-existent!
  • 8. Test data Synthetic test data requirements Real Synthetic 8 • Child care tax deduction example • Test data should be “statistically representative” (realistic circumstances) Should be representative of the actual or anticipated system usage
  • 9. • Child care tax deduction example • Such data help reasoning about robustness rather than reliability Test data Must be logically and structurally well-formed to detect meaningful failures Synthetic test data requirements Real Synthetic 9 {Logically invalid - receives_child_allowance = true T1: ResidentTaxPayer C1: Child - receives_child_allowance = false T1: ResidentTaxPayer C1: Child - receives_child_allowance = true T1: ResidentTaxPayer - receives_child_allowance = false T1: ResidentTaxPayer
  • 10. 10 Objective Test data for statistical testing of data-centric systems Valid Representative How to automatically generate test data that meet both requirements at the same time in a scalable manner?
  • 11. 11 Data generation at a glance • Validity • Representativeness Exhaustive search Metaheuristic-search - Constraint programming [Cabot et al., JSS 2014] - Alloy [Sen et al., ICMT 2019] - Alternating Variable Method (AVM) [Ali et al., TSE 2013] [Ali et al., ESE 2016] Heuristics Sampling - Rule-based [Hartmann et al., SmartGridComm 2014] - Model-based [Soltana et al., SoSyM 2016] - Boltzmann's random sampling [Mougenot et al., ECMDA-FA 2009]
  • 13. 13 Approach overview Synthetic data sample (test suite) Data schema (class diagram) 2. Define statistical characteristics Annotated data schema <<s>> <<p>> <<p>> <<m>> 3. Define data validity constraints Constraints (explicitly defined) 4. Generate synthetic data OCL 1. Define data schema
  • 14. 14 Approach overview Synthetic data sample (test suite) Data schema (class diagram) 2. Define statistical characteristics Annotated data schema <<s>> <<p>> <<p>> <<m>> 3. Define data validity constraints Constraints (explicitly defined) 4. Generate synthetic data OCL 1. Define data schema
  • 15. 15 The statistical characteristics of the data [Soltana et al., SoSyM 2016] • We use a UML profile for defining the statistical characteristics of the data • The profile is composed of a set of statistical annotations (stereotypes) UML profiles are a standard mechanism to extend UML models with additional modeling concepts
  • 16. Relative frequencies 60% of income types are Employment, 20% are Pension, and the remaining 20% are Other Income Employment «probabilistic type» {frequency: 0.6} Pension «probabilistic type» {frequency: 0.2} Other «probabilistic type» {frequency: 0.2} (abstract) [Soltana et al., SoSyM 2016] 16
  • 17. 17 Approach overview Synthetic data sample (test suite) Data schema (class diagram) 2. Define statistical characteristics Annotated data schema <<s>> <<p>> <<p>> <<m>> 3. Define data validity constraints Constraints (explicitly defined) 4. Generate synthetic data OCL 1. Define data schema
  • 18. 18 Define data validity constraints • We use OCL to define data validity constraints • Some of the constraints are derived from the class diagram and its annotations • The remaining constraints are explicitly defined by users context Physical_Person inv implicit_annotations: self.birth_year <= 2017 and self.birth_year >= 1917 context Physical_Person inv explicite1: self.children->forAll(c| c.birth_year > self.birth_year + 16)
  • 19. 19 Approach Overview Synthetic data sample (test suite) Data schema (class diagram) 2. Define statistical characteristics Annotated data schema <<s>> <<p>> <<p>> <<m>> 3. Define data validity constraints Constraints (explicitly defined) 4. Generate synthetic data OCL 1. Define data schema
  • 20. 20 Our strategy for generating data for statistical testing Phase 1: Enforce all OCL validity constraints All validity constraints OCL Data schema annotated with statistical distributions <<s>> <<p>> <<p>> <<m>> Valid data sample Final data sample Phase 2: Attempt to improve representativeness while maintaining validity
  • 21. 21 Phase 1: generate logically valid test data Create seed sample Create valid sample Seed sample with potential logical anomalies Valid data sample All validity constraintsOCL Uses our heuristic data generator for producing representative (but invalid) data sample Uses a customized search-based OCL solver • The valid sample might not end up too far away from being representative, in turn making it easier to fix deviations from representativeness in phase 2
  • 22. 22 Phase 2: generate valid and representative test data Valid data sample All validity constraints Generate corrective constraints Propose tweaked instance model Corrective constraints Tweaked instance model Final data sample (for each instance model in the sample) OCL Uses the same search-based OCL solver Provides cues, via “soft” OCL constraints, on how an instance model should be tweaked so that the representativeness of the whole data sample is improved
  • 23. 23 Example of a corrective constraint - + - Current distribution of the sample Desired distribution An instance model Income.allInstances()-> select(num = 1)->forAll( (type <> IncomeTypes::Employment) and (type = IncomeTypes::Pension or type = IncomeTypes::Other)) Corresponding corrective constraint - id = 1 T1: ResidentTaxPayer I1: Income - num = 1 - type = Employment
  • 24. 24 Quantifying and improving representativeness • We decide whether a (valid) tweaked instance model should replace the original one in the sample using the Euclidian Distance (ED) Desired distribution Vs. Current distribution of the sample Desired distribution Vs. Distribution of the sample if the tweaked instance model replaces the original one Has a better alignment (thus a lower ED)
  • 26. 26 Case study • Statistical testing of reliability requirements of a re-designed tax management system in Luxembourg • Using actual data is not option: - Actual data is sensitive and of a personal nature, sharing the data with third-parties poses complications - There are structural mismatches between the data schema used by the system under test and the actual historical data
  • 27. 27 Case study inputs • Data schema (UML class diagram) • Statistical characteristics of the data items • 68 OCL validity constraints Classe Enum. Association Generalization Attribute #elements 64 17 53 43 344 Histogram Cond. dist. Other # 15 7 13 Age Householdsize Residence status … Income for pensioners Foreign income types … STATEC Uniform distribution for the day of the year on which individuals are born … Let OCL op. Nested if-then-else Quantifier Logical op. # 64 17 10 7 212
  • 28. 28 Research questions • RQ1: Does our synthetic data generator run in practical time? • RQ2: Can our approach generate data samples that are both valid and statistically representative?
  • 29. 29 RQ1: execution time of the data generator • We report on the average execution time (5 times) of our data generator for building data samples of different sizes, ranging from 100 to 1000 • Our data generator could produce data samples with up to 1000 instance models (representing tax cases) in less than 10 hours - Execution time for generating initial (invalid) solutions is negligible - Constraint solving accounts on average for 85% of the execution time - Generating corrective constraints accounts on average for 14% of the execution time
  • 30. 30 RQ1: execution time of the data generator This execution time is practical in our context since: • Data generation can be performed overnight. • For more complex systems, parallelization of search during constraint solving can be considered. • Test data generation can be initiated well in advance of the testing phase, and as soon as the data profile for the system under test has stabilized.
  • 31. 31 Research questions • RQ1: Does our synthetic data generator run in practical time? • RQ2: Can our approach generate data samples that are both valid and statistically representative?
  • 32. 32 RQ2: quality of the generated data • Average (5 times) distances between the statistical distributions in a given sample and the corresponding desired distributions • Three distance metrics: Euclidean, Manhattan and Canberra metrics Data samples created by our data generator are valid, and at the same time, surpassing the state-of-the-art in terms of representativeness 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 100 200 300 400 500 600 700 800 900 1000 Euclideandistance Number of instance models in the sample Distance for (invalid) seed sample Distance for final valid sample
  • 33. 33 Summary • A model-based data generator for supporting statistical testing of data-centric systems • The data generator produces high-quality test data in a practical time: - Produced test data samples are representative of with the actual or anticipated system usage - Produced test data samples are structurally and logically valid - The generator is able to produce up to 1000 instance models (here, tax cases) in less than 10 hours
  • 34. .lusoftware verification & validation VVS .lusoftware verification & validation VVS Synthetic Data Generation for Statistical Testing Ghanem Soltana ghanem.soltana@uni.lu Tool available at http://people.svv.lu/tools/SDG/ SnT Centre for Security, Reliability and Trust University of Luxembourg, Luxembourg ASE 2017