PhD Defense Slides - Automatic Non-functional Testing and Tuning of Configurable Generators

1
Hélène WAESELYNCK, LAAS-CNRS Toulouse
Philippe MERLE, INRIA Lille
Erven ROHOU, INRIA Rennes
Franck FLEUREY, SINTEF Oslo
Jean-Marie MOTTU, Université de Nantes
Gerson SUNYÉ, Université de Nantes
Benoit BAUDRY, INRIA Rennes
Olivier BARAIS, Université de Rennes 1
Automatic Non-functional Testing and Tuning
of Configurable Generators
Mohamed BOUSSAA
PhD Defense
September 6, 2017
University of Rennes 1/INRIA Rennes
Reviewer
Reviewer
Examiner
Examiner
Examiner
Examiner
Advisor
Advisor

3
Context
Arduino UNO
ATmega328 8-bit 16MHz
Intel Edison
Atom 2 500Mhz 1Gb
Espressif ESP8266
32-bit, 80MHz, WiFi
Software Innovation
Software Diversity
-New programming languages
-Software platforms
-Execution environments
Hardware Innovation
Hardware Heterogeneity
-New CPU architecture, ISAs
-2x faster, smaller
-Cheaper, more capable, etc.
Raspberry Pi (B+/2)
ARM v7 – 700MHz ->

Building software in
heterogeneous environment
is complex

4
Generative software development
Code
generation
DSL
s
GPLs
GUIs Models
SPECs
Design
Runtime
-  Develop the code easily and rapidly
-  Handle the heterogeneity of target so[ware/hardware pla]orms by automa^cally
genera^ng code

2
1
Automatic code generation
5
Machine Code
Generators
GPLs
High-level
program speciﬁca^on
GPLs
Templates
Configurations
Files Flags
a highly configurable process

2
1
6
Machine Code
GPLs
HAXE programs
GPLs
GPLs
Variants
Target Platform

2
1
7
Machine Code
C Code
Variants
Optimisation flags
(e.g., CFLAGS)
HAXE programs
GPLsGPLs
Variants
-Os
-ftree-vectorize
-Og
-O3
-Ofast
-O2

8
2
1
GPLs
High-level
Templates
Configurations
Files Flags
Generated code must be effectively tested
Generators

9
2
1
GPLs
High-level
Templates
Configurations
Files Flags
Tests pass: no bugs in generators
Generators
I will never write
code again !

10
2
1
GPLs
High-level
Templates
Configurations
Files Flags
Tests fail: bugs must be fixed !
Generators
I will never use
generators again !

11
2
1
GPLs
High-level
Templates
Configurations
Files Flags
Tests pass, but what about the non-functional properties (quality) of generated code ?
Generators
It is too slow !
I am running out
of memory

12
Generator
experts
Genrerator
users
Build and maintain
I need to produce
more efficient code.
Turn on op5misa5ons?
I need to evaluate the
quality of generated
code
Non-functional testing and tuning of generators
How can I
automa^cally detect
the non-funcônal
issues?
Use and configure
Which configuraôn
should I select?

13
Generator
experts
Genrerator
users
Build and maintain
I need to produce
more efficient code.
Turn on op5misa5ons?
I need to automa5cally
evaluate the quality of
generated code
Challenges
How can I detect
non-funcônal
issues?
Use and configure
Which configuraôn
should I select?
Non-functional testing of generators

•  Oracle problem
Auto-tuning configurable generators

•  Huge configuraôn space
Collecting the non-functional metrics
•  Handling the diversity of so[ware and
hardware pla]orms

Related work
14
¤  Tes5ng generators
•  Funcônal tes^ng: executable models, differenâl tes^ng [Conrad et al. ’10, Stuermer et al. ’07]
Do not address the NF properês
¤  Auto-tuning generators:
•  Auto-tuning: a mono objec^ve op^mizaôn [Bashkansky et al. ’07, Stephenson et al. ’03]
•  Phase ordering problem [Kulkarni et al. ‘06, Cooper et al. ’99]

•  Predic^ng op^mizaôns: a machine learning op^mizaôn [Fursin et al. 11’]
•  Conflic^ng objec^ves: a mul^-objec^ve op^mizaôn [Hoste et al. 08’, Mar^nez et al. ’14]
Do not exploit recent advances in SBSE (e.g., diversity-based exploraôn)
14

It is crucial to verify the correct
(non-func5onal) behavior of code generators in
order to preserve so[ware quality and reliability

Generator configuraôns must be
properly selected to sa^sfy the user
(non-func5onal) requirements

Do this automa^cally, efficiently, and without the use of
a priori knowledge of the generator technology and its
configuraôns
Thesis statement
15
Genrerator
users
Generator
experts
15

16
Contribution II:
An auto-tuning approach
Contribution I:
An automatic non-functional
testing approach
Select the best
configuration
Generator
experts
Build/maintain
Genrerator
users
Identify code
generator issues
Contribution III:
A lightweight environment for monitoring
and testing the generated code
Use/configure

Contribution I:
Automatic non-functional testing of code
generators
https://testingcodegenerators.wordpress.com/
17

2
1
Non-functional testing of source code generators
18
Machine Code
GPLs
High-level
GPLs
Templates
Configurations
Files Flags
Generators

Challenges
The oracle problem when automa^ng the non-funcônal tests:
-  No clear definiôn about how the oracle should be defined
19
High-level
Execuôn
speed
Resource
usage
Test cases
Code generator

19

Code generator families
20
Defini5on (Code generator family):
A set of code generators that takes as input the same language/model and generate code
for different target so9ware pla;orms
20
High-level
Compare the non-
funcônal behavior of
programs generated
from the same code
generator family
Test cases
Code generator
family
20

Leveraging metamorphic testing to
automatically detect inconsistencies
Metamophic tes5ng1 (MT):
-  Oracles can be derived from properês of the system under test
-  Exploit the relaôn between the inputs and outputs of special test cases of the system
under test to derive metamorphic relaôns (MRs) defined as test oracles for new test
cases
21
Metamorphic
Rela5on
Derive
21
21
Original test cases Outputs
Verify New test cases Outputs
21
1Chen et al., Metamorphic tes^ng: a new approach for genera^ng next
test cases, University of Science and Technology, Hong Kong, 1998.

Metamorphic Relation (MR)
This MR is equivalent to say:
If a set of funcônally equivalent programs are generated using the same code generator
family ((P1(^), P2(^),...,Pn(^)), and with the same input test suite ti,

then the comparison of their non-funcônal outputs
defined by the variaôn , should not exceed a specific threshold value T.
22
22
22
Equivalent programs Non-funcônal outputs
(e.g., execuôn ^me, memory usage)
22

Statistical methods
o  We propose two variaôn analysis approaches to define the threshold value

23
23
23
R-Chart (Range Chart) PCA (Principal Component Analysis)
Ø  A mul^variate sta^s^cal approach
Ø  Reduce the dimensionality of the original data to a two
dimensions (PC1 & PC2)
Ø  A score distance SD measures how far an observaôn
lies from the rest of the data within the PCA subspace
Ø  SD with cutoff value higher than 97.5%-Quan^le Q of the
Chi-square distribuôn are detected as outliers
Ø  Evaluate the variaôn as a Range R (Max - Min)
Ø  Control limits (LCL and UCL) represent the limits
of variaôn that should be expected from a
process

LCL < R < UCL T <
23

24
I(ti) Code generator
family

Pn(ti)

P1(ti)
Detect
inconsistencies
Repeat
Generated programs + test
cases
Metamorphic testing process
P2(ti)
.
.
.
Execution

f(Pn(ti))

f(P1(ti))

f(P2(ti))

.
.
.
MR
Verification
Non-functional output:
Execution time or memory
usage
Input program + test
cases

Evaluation
Ø  1 code generator family:
–  Haxe code generators
–  Five target programming languages: JAVA, JS, C++, C#, and PHP
Ø  2 non-functional metrics:
–  Performance and Memory usage
Ø  7 benchmark libraries:
25
RQ: “ How eﬀec^ve is our metamorphic tes^ng approach to automa^cally
detect inconsistencies in code generator families? ”
Experimental setup:

R-chart results: performance variations
26
11 performance inconsistencies are identified

R-chart results: memory variations
27
15 memory usage inconsistencies are identified

PCA results: performance variations
28
PCA Detected outliers
4 performance inconsistencies are identified
JAVA: 4.1
JS: 1
CPP: 5.4
C#: 5.7
PHP: 481.6
JAVA: 1.28
JS: 2.9
CPP: 1
C#: 3.3
PHP: 261.3
JAVA: 1.1
JS: 2.7
CPP: 1
C#: 3.6
PHP: 258.9
JAVA: 1
JS: 12
CPP: 3.1
C#: 4
PHP: 80
Test suites
PC1
PC2

PCA results: memory variations
29
PCA Detected outliers
4 memory usage inconsistencies are identified
JAVA: 250.7
JS: 71.7
CPP: 1
C#: 69.9
PHP: 454.1
JAVA: 11.9
JS: 1
CPP: 14.6
C#: 36.1
PHP: 620.2
JAVA: 214.5
JS: 92.4
CPP: 1
C#: 57.6
PHP: 224.4
JAVA: 1.2
JS: 1
CPP: 1.7
C#: 3.6
PHP: 675
Test suites
PC1
PC2

Analysis
30
v  For Core_TS4 in PHP:

•  We observe the intensive use of « arrays »
•  Arrays in PHP are allocated dynamically, leading to a slower wr^ng
speed
•  We replace « arrays » by « SplFixedArray »

⇒  Speedup x5
⇒  Memory usage reducôn x2
⇒  Issue fixed by the Haxe community
Key findings:

-  The lack of use of specific types that exist in the standard library shows a real
impact on the non-funcônal behavior of generated code.

30

Conclusion
§  A non-functional metamorphic relation is used to detect code generator
inconsistencies
−  Two statistical methods are applied to find the right MR definition
§  The evaluation results show that:
−  11 performance and 15 memory usage inconsistencies, violating the
metamorphic relation for Haxe code generators
−  The analysis of test suites triggering the inconsistencies shows that there
exist potential issues in some code generators, affecting the quality of
delivered software
31

Contribution II:
NOTICE: An approach for auto-tuning generators
https://noticegcc.wordpress.com
32

2
1
Auto-tuning compilers
33
GPLs
High-level
GPLs
Generators
Machine CodeOptimisation flags
(e.g., CFLAGS)

Motivating example
¤  GCC 4.8.4:
- 78 optimizations
- 278 combinaôns

34
Speedup,
Memory,
etc.
Resource
Constraints
WHY
ALWAYS
ME !!
-BOSS: Clients complain about the high
memory consumption
-BOSS: Is it possible to consume less
CPU?
we don’t have enough resources/money
-BOSS: Please, can we optimize even
more ?
Good luck Son !!
34
- Tes^ng each op^mizaôn configuraôn is impossible
- Heuris^cs are needed
34
34

Compiler auto-tuning is complex
35

¤ Construc^ng a good set of op^mizaôn levels (-Ox) is hard

•  Conflic^ng objec^ves
•  Complex interacôns
•  Unknown effect of some op^mizaôns
35 35

Contribu5on

¤ Novel formulaôn of the compiler op^mizaôn problem using Novelty
Search1
ü  Diverse op^mizaôn sequences

ü  Explore the large search space by considering Novelty as the main

objec^ve

We propose:
36 36
1Lehman et al., Exploi^ng Open-Endedness to Solve Problems Through the
Search for Novelty. In ALIFE 2008

Diversity-based explora5on
gcc –c test.c –fno-dce –fno-dse -fno-align-loops …
Muta5on:
Crossover:
Best solu5on:
Solu5on with best non-func5onal improvement
0 0 1 0 …
Step 2:
Evalua5on
…
Archive:
Novelty metric:
Step 3:
Selec5on
Step 4:
Evolu5onary
operators 0 1 1 1 0 …
0 1 1 1 0 …
1 0 0 1 1 …
Go To
Step 2
Solu5on representa5on:
Saves solu>ons that get a novelty
metric value higher than a speciﬁc
novelty threshold value.
Calculate the distance of one solu>on
from its K Nearest neighbors in current
popula>on and in the Archive.
Step 1:
Random
genera5on
37
Select solu>ons to evolve
based on novelty scores.
Tournament selec5on:
37

Evalua5on
Ø  Programs under test
–  Csmith code generator
–  111 Csmith generated programs
–  6 Cbench benchmark programs
Ø  Evolutionary algorithms
Ø  Compiler under test
–  GCC 4.8.4
Ø  Evaluation metrics
–  Speedup (S)
–  Memory/CPU Consumption Reduction (MR and CR)
–  Tradeoff <Speedup – Memory usage>
38
Mono Objec^ve
Novelty Search (NS)
Gene^c Algo (GA)
Random Search (RS)
Mul^ Objec^ve
Novelty Search (NS-II)
NSGA-II
Over -O0

Research Questions
RQ1: Mono-objec5ve SBSE Valida5on.

Op^mizaôns
Non-funcônal
metric
Training set programs
Best
sequence
RQ2: Sensi5vity of input programs to
op5miza5on sequences.

Unseen programs
Non-funcônal
improvement Best sequence
in RQ1
RQ3: Impact of speedup on resource
consump5on.

RQ4: Trade-offs between non-func5onal
proper5es.

Best Speedup
Sequence
In RQ1
Impact on
resource
consumpôn Op^mizaôns
Pareto front
soluôns
39
Training set programs Mul^-objec^ve search
Mono-objec^ve search
Non-funcônal
Trade-off
<^me-memory>
Input program
39

RQ1- Results
RQ1: Mono-objec5ve SBSE Valida5on.
-  Training set: 10 Csmith programs
-  Average S, MR, and CR
-  Comparison: Ox, RS, GA and NS
Key findings for RQ1:
– Best discovered op^mizaôn sequences using mono-objec^ve search techniques always provide beber results than
standard GCC op^mizaôn levels.

– Novelty Search is a good candidate to improve code in terms of non-funcônal properês since it is able to discover
op^mizaôn combinaôns that outperform RS and GA.
Search for best op^mizaôn
sequence
Best
sequence
Op^mizaôns
Non-funcônal
Metric
Training set programs
4040

RQ2- Results
– It is possible to build general op^mizaôn sequences that perform beber than standard op^mizaôn levels

– Best discovered sequences in RQ1 can be mostly used to improve the memory and CPU consumpôn of Csmith
programs. To answer RQ2, Csmith programs are sensi^ve to compiler op^mizaôns.
RQ2: Sensi5vity.
-  100 unseen Csmith programs
-  O2 vs O3 vs NS
Unseen programs
Non-funcônal
improvement
Best Sequence
In RQ1
4141

RQ3- Results
RQ3: Impact of op5miza5ons on resource consump5on.
-  Ox vs RS vs GA vs NS

– Op^mizing so[ware performance can induce undesirable effects on system resources.

– A trade-off is needed to find a correlaôn between so[ware performance and resource usage.
Best Speedup
Sequence
In RQ1 Training set programs
Impact on
Resource CPU & memory
42
Memory
reducôn
Increase of resource
usage
CPU
reducôn
42

RQ4- Results
RQ4: Trade-offs between non-func5onal proper5es.
-  1 Csmith program
-  Trade-off <execuôn ^me-memory usage>
– NOTICE is able to construct op^mizaôn levels that represent op^mal trade-offs between non-funcônal
properês.

– NSGA-II performs beber than our NS adaptaôn for mul^-objec^ve op^mizaôn. However, NS-II performs clearly
beber than standard GCC op^mizaôns and previously discovered sequences in RQ1.
43
Op^mizaôns Pareto front
soluôns
Mul^-objec^ve search
Trade-off ^me/memory
Input program
Pareto front NS-II
(mul^-objec^ve)
Ofast
O3
O2
O1
Best CPU reducôn
(mono-objec^ve)
Best memory reducôn
(mono-objec^ve)

Pareto front NSGA-II
(mul^-objec^ve)

43

Conclusion
§  Novel formulation of the compiler optimization problem based on Novelty
Search
§  Novelty Search is able to generate effective optimizations
−  Generated sequeces perform better than standard levels
−  Our approach outperfroms classical approaches (GA and RS)
§  Trade-offs between non-functional properties are constructed
−  NSGA-II performs better than NS and mono-objective approaches
44

Contribution III:
A lightweight environment for monitoring and
testing the generated code
45

2
1
Challenges
46
Machine Code
Generators
GPLs
High-level
GPLs
Templates
Configurations
Files Flags
Heterogeneity
-Monitoring the resource usage of generated code
-Control and limit resources
Reproducibility
-Reproduce tests across different enviornement
settings
Efficiency
-Deploy automatically the generated code without
affecting the system performance

Infrastructure Overview
47
¤  We propose:

•  A micro-service infrastructure, based on system containers (Docker) as execuôn
pla]orms, that allow generator experts/users to evaluate the non-funcônal
properês of generated code

47
RuntimeMonitoringEngine
Container C
Container B
Container A
SUT
SUT
SUT
Generate
Generate
Generate
Code
Generator A
Code
Generator B
Code
Generator C
Code Generation Runtime monitoring engineCode Execution
Container A’
Container B’
Container C’
Footprint C’
Footprint A’
HTTP
requests
Footprint B’Request
Resource usage extraction
Resource usage
DB

Microservice-based infrastructure
48

¤ Execute and monitor of the generated code using system containers
ü  Different configuraôns, instances, images, machines, etc

ü  Resource isolaôn and management

ü  Less performance overhead

48

Monitoring environment
Component
Under Test
Back-end
Database
Component
Cgroup file systems
Running…
Monitoring records
Front-end:
Visualizaôn
Component
Time-series database
HTTP Requests
CPU
Memory
…
49
8086:
Monitoring
Component
…
Code
Generaôn +
Compilaôn
Software
Tester
49

Conclusion
§  Effective support for automatically deploying, executing, and testing the
generated code in different environment settings
§  The conducted experiments showed the usefulness of this infrastructure for
tuning and testing generators
50

Conclusion and Perspectives
51

52
Generator
experts
Genrerator
users
Build and maintain
I can now easily
determine the best
configuration settings
for my generator
I am now able to
automa5cally test my
code generator family
in terms of NFP
Conclusion
Effective support for testing and resource
usage monitoring
Use and configure

Perspectives
53
v  Combine the proposed black-box approach with tracability tools:
•  Tracking the source of code generator inconsistencies
v  Speed up the ^me required to tune and test generators:
•  Deploy tests on many nodes in the cloud using mul^ple containers in parallel
v  Automa^c test case generaôn:
•  Test amplificaôn
•  Evaluate the quality of executed tests (e.g., code coverage)

v  Improve the auto-tuning approach:
•  Evaluate other compilers (e.g., LLVM, Clang)
•  Explore more tradeoffs among resource usage metrics
•  Evaluate different hardware se}ngs

Publications
•  Mohamed Boussaa, Olivier Barais, Benoit Baudry, Gerson Sunyé: Automa5c Non-func5onal Tes5ng of
Code Generators Families. In The 15th Internaônal Conference on Genera^ve Programming: Concepts &
Experiences (GPCE 2016), Amsterdam, Netherlands, October 2016.
•  Mohamed Boussaa, Olivier Barais, Benoit Baudry, Gerson Sunyé: NOTICE: A Framework for Non-
func5onal Tes5ng of Compilers. In 2016 IEEE Internaônal Conference on So[ware Quality, Reliability &
Security (QRS 2016), Vienna, Austria, August 2016.
•  Mohamed Boussaa, Olivier Barais, Benoit Baudry, Gerson Sunyé: A Novelty Search-based Test Data
Generator for Object-oriented Programs. In Gene^c and Evoluônary Computaôn Conference
Companion (GECCO 2015), Madrid, Spain, July 2015.
•  Mohamed Boussaa, Olivier Barais, Benoit Baudry, Gerson Sunyé: A Novelty Search Approach for
Automa5c Test Data Genera5on. In 8th Internaônal Workshop on Search-Based So[ware Tes^ng
(SBST@ICSE 2015), Florence, Italy, May 2015.
Under review:
•  Mohamed Boussaa, Olivier Barais, Benoit Baudry, Gerson Sunyé: Leveraging Metamorphic Tes5ng to
Automa5cally Detect Inconsistencies in Code Generator Families. IEEE Transacôns on Reliability, August
2017.
54

Thank you for your attention
55

ThingML case study
56
§  3 targets: C, JAVA, JS
§  120 test cases
§  Memory usage
§  2 identified inconsistencies

NSGA-II overview (I)
58
•  NSGA-II: Non-dominated Sorting Genetic Algorithm (K.‎ Deb et al., ’02)
Parent
Population
Offspring
Population
Non-dominated
sorting
F1
F2
F3
F4
Crowding distance
sorting
Population in
next
generation
MOEA Framework hbp://moeaframework.org/

PLDI’11
PLDI’14
SOTA: Func5onal Tes5ng of Compilers
60

SOTA: Auto-tuning Compilers
CGO’08
ACSAC’08
PLDI’04
61

Run5me Monitoring in Grafana
62

PhD Defense Slides - Automatic Non-functional Testing and Tuning of Configurable Generators

Recommended

Recommended

More Related Content

Similar to PhD Defense Slides - Automatic Non-functional Testing and Tuning of Configurable Generators

Similar to PhD Defense Slides - Automatic Non-functional Testing and Tuning of Configurable Generators (20)

Recently uploaded

Recently uploaded (20)

PhD Defense Slides - Automatic Non-functional Testing and Tuning of Configurable Generators