SlideShare a Scribd company logo
1 of 4
Download to read offline
Applying Machine learning techniques to select variables responsible for
compiler performance variation
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The report will show the results found after applying Ma-
chine Learning techniques to discover new ways for opti-
mising compiler codes. A researcher in the area of compil-
ers performed a series of experiments with the LLVM com-
piler, enabling and disabling optimizations independently
for each test program (discrete variables). The experiments
involve 45 different optimizations (input parameters) and
one target variable that is the program runtime (continu-
ous variable). The dataset comprises 46,945 examples (with
noisy data) divided in 19 different programs (around 2,400
instances for each program).
2. Activities
Regarding applying Classification Trees to apply induc-
tive inference, that is, reaching general conclusion from
specific examples, it can be cited Breiman et al. [1]. It
seems appropriate to apply this technique because the input
variables are discrete. It was also observed that the origi-
nal data contains several repetitions of the same input data,
which is another advantage in favour of applying Classifi-
cation Trees - its robustness to noisy data.
3. Proposed Solutions
It was implemented one algorithm to deal with the prob-
lem, developed in R language using rpart function and
method=class.
3.1. Classification Trees and Information Gain
To build the classification tree one fundamental concept
is to find the root node (attribute that best splits the data
over). One of the measures used is Entropy (H), which
measures the homogeneity of the examples, calculated as
below:
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
H(S) =
c
i=1
(−pi ∗ log2 pi) (1)
The tree split function to find non-leaf nodes will be In-
formation Gain, which measures the reduction on Entropy
as follows:
IG(S, A) = H(S) −
v values(A)
(
|Sv|
|S|
) ∗ H(Sv) (2)
where Sv is the subset of S for which A has value v.
3.2. Quality measures
To access the quality of the results it will be used Preci-
sion, Accuracy and AUC (based on ROC curve). Precision
is defined as the proportion of the true positives against all
the positive results (in the case the reference is the good per-
formance items). Accuracy is the proportion of true (cor-
rectly identified) results (both true positives and true neg-
atives) in the classification. AUC is the probability that
a classifier will rank a randomly chosen positive instance
higher than a randomly chosen negative one.
4. Experiments and Discussion
4.1. Data preprocessing
To deal with the large number of repetitions on data, we
used the function unique to let only substantial cases to an-
alyze.
4.1.1 Data splitting
The data was splitted in 3 partitions for each program under
analysis using the following propotions: 60% for training,
20% for validation and 20% for testing. This was imple-
mented on R as below:
s p l i t d f <− function ( dataframe , seed=NULL) {
i f ( ! i s . null ( seed ) ) s e t . seed ( seed )
index <− 1: nrow ( dataframe )
1
#60% f o r t r a i n i n g
t r a i n i n d e x <− sample ( index ,
trunc ( length ( index ) ∗ 0 . 6 ) )
t r a i n s e t <− dataframe [ t r a i n i n d e x , ]
o t h e r s e t <− dataframe [− t r a i n i n d e x , ]
o t h e r I n d e x <− 1: nrow ( o t h e r s e t )
#20% f o r t r a i n i n g and
#20% f o r t e s t i n g s e t
v a l i d a t i o n I n d e x <−
sample ( otherIndex ,
trunc ( length ( o t h e r I n d e x ) / 2 ) )
v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ]
t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ]
l i s t ( t r a i n s e t = t r a i n s e t ,
v a l i d a t i o n s e t = v a l i d a t i o n s e t ,
t e s t s e t = t e s t s e t )
}
The table summarizes number of instances after prepro-
cessing phase.
Prog. Noisy Unique Train Valid Test
Data Data Data (60 %) (20 %) (20 %)
1 2468 172 103 34 35
2 2473 222 133 44 45
3 2470 228 137 46 45
4 2475 217 130 43 44
5 2468 218 131 44 43
6 2479 250 150 50 50
7 2451 219 131 44 44
8 2476 201 121 40 40
9 2468 224 134 45 45
10 2472 197 118 39 40
11 2476 214 128 43 43
12 2472 191 115 38 38
13 2473 228 137 46 45
14 2467 210 126 42 42
15 2468 217 130 43 44
16 2470 168 101 34 33
17 2473 211 127 42 42
18 2478 245 147 49 49
19 2468 199 119 40 40
All 46945 4031 2418 806 807
Table 1. Dividing data in training/validation/testing
4.2. Runtime classification
To discretize the runtime values for each instance as Pos-
itive (good performance) or Negative (Not so good perfor-
mance), we used the function scale to apply z-normalization
(centering in the mean and dividing by the standard devia-
tion σ).
The first rule applied was the following partition around
the mean:
•if(z−norm−runtime < 0) ⇒ ”Good−performance”
•if(z−norm−runtime >= 0) ⇒ ”Bad−performance”
This approach is based on the histogram of the items (as
the example at Figure 1 for program 1).
It was also plotted the boxplots (Figure 2) to understand
if there are outliers and their distributions.
Figure 1. Histogram of the program 1.
Figure 2. Box plot of the program 1 runtime distribution.
The second rule applied was the following partition
around the quartiles:
•if(z−norm−runtime < 25%quartile) ⇒ ”V eryGood−performance
•if(z−norm−runtime < 50%quartile) ⇒ ”Good−performance”
•if(z−norm−runtime < 75%quartile) ⇒ ”Bad−performance”
•if(z−norm−runtime < 25%quartile) ⇒ ”V eryBad−performance”
4.3. Results
4.3.1 Partition around the mean
The classification tree for the entire dataset is at Figure 3.
To summarize the individual trees found using the mean
separation, a table with the 5 initial params found was cre-
ated (see Table 2)
4.3.2 Partition using quartiles
The classification tree for the entire dataset is at Figure 4.
Prog. DT Height Pruned Par 1 Par2 Par3 Par4 Par5
1 1 1 basicaa=1
2 7 2 sroa=0 loop.rotate=0
3 10 10 simplifycfg=0 sroa=0 gvn=0 memcpyopt=0 jump.threading=1
4 8 3 instcombine=0 sroa=0 loop.rotate=1 adce=1 functionattrs=1
5 2 2 licm=0 loop.rotate=0
6 4 3 sroa=0 simplycfg=0 instcombine=0 basicaa=0
7 1 1 sroa=1
8 1 1 tailcallelim=0
9 4 3 sroa=0 inline=0 loop.rotate=1 loop.deletion=0
10 4 1 functionattrs=0 loop.simplify=0 verify=0 simplifycfg=0
11 1 1 sroa=0
12 1 1 sroa=0
13 6 2 sroa=0 loop.rotate=1 globalopt=0 loop.deletion=1 lcssa=1
14 3 3 sroa=0 inlinecost=0 gvn=0
15 10 2 loop.rotate=0 tailcallelim=1 deadargelim=0 instcombine=1
16 2 2 inline=0 loop.rotate=0
17 4 3 simplifycfg=0 sroa=0 basicaa=0 instcombine=0
18 5 5 loop.rotate=0 sccp=1 indvars=1 ipscco=1 early.cse=1
19 1 1 sroa=0
All 2 2 sroa=0 inline=0
Table 2. Tree size and 5 top paramaters to find good performance for mean partition
Figure 3. Classification tree for all programs using mean.
To summarize the individual trees found using the quar-
tile separation, a table with the 5 initial params found was
created (see Table 3)
4.3.3 Classification trees quality measures
5. Conclusions and Future Work
Analysing Figure 3 and Figure 4 it is possible to find
the set of features which are most important for optimiz-
ing the code when using all the programs at the same time.
In both cases (mean and quartile separation), sroa=0 (a few
times as 1) and inline=0 where the first parameters in com-
mon. For the quartile separation it was also possible to
Figure 4. Classification tree for all programs using quartile.
use simplifycfg=0, gvn=0, basicaa=0,jump.threading=0 as
a way to classify in the first quartile (that are the best run-
time values). Table 2 presents how the solutions varies
for each program when applying mean partition. We used
a cross validation (using test set) technique to find the
pruned height size. Not all programs need sroa=0 and in-
line=0 as it would be expected. Some are common for
Prog. Height Pruned Par 1 Par2 Par3 Par4 Par5
1 8 4 basicaa=1 licm=0 strip.dead.prototypes=1 preverify=1
2 10 10 sroa=0 strip.dead.prototypes=0 basiccg=1 basicaa=1 scalar.evolution=1
3 7 7 sroa=0 simplifycfg=0 globalopt=1 memcep=1 loop.deletion=1
4 3 7 instcombine=0 sroa=0 loop.rotate=1
5 6 6 licm=0 loop.rotate=0 loop.idiom=1 instcombine=0 strip.dead.prototypes=0
6 7 2 sroa=0 loop.rotate=1
7 9 3 sroa=1 simplifycfg=0 basicaa=1
8 8 6 tailcallelim=0 basicaa=0 memdep=1 early.cse=1 loop.unroll=1
9 8 4 sroa=0 inline=0 loop.rotate=1 instcombine=1
10 9 9 loop.rotate=1 memdep=0 simplifycfg=0 basicaa=0 preverify=1
11 10 7 sroa=0 strip.dead.prototypes=0 basiccg=1 deadargelim=0 domtree=0
12 7 7 sroa=0 instcombine=1 loop.rotate=0 targetlibinfo=0 prune.eh=1
13 10 1 loop.rotate=1
14 7 2 sroa=0 inline.cost=0
15 6 5 loop.rotate=0 tailcallelim=0 prune.eh=0 correlated.propagation=0 preverify=0
16 8 7 inline=0 loop.rotate=0 jump.threading=0 targetlibinfo=1 notti=1
17 11 5 sroa=0 simplifycfg=0 basicaa=0 instcombine=0 deadargelim=0
18 8 3 sroa=0 indvars=0 constmerge=0
19 7 1 loop.rotate=1
All 6 6 sroa=0 inline=0 simplifycfg=0 gvn=0 basicaa=0
Table 3. Tree size and 5 top paramaters for quartile partition to find good performance
Prog. Prec. Prec. Acc. Acc. AUC AUC
Mean Quartiles Mean Quartiles Mean Quartiles
1 0.50 0.65 0.74 0.59 0.54 0.59
2 0.69 0.57 0.64 0.57 0.61 0.42
3 0.31 0.54 0.72 0.61 0.52 0.64
4 0.76 0.71 0.74 0.70 0.68 0.70
5 0.82 0.50 0.84 0.59 0.78 0.59
6 0.91 0.74 0.80 0.72 0.80 0.72
7 0.80 0.63 0.57 0.61 0.55 0.52
8 0.75 0.50 0.75 0.58 0.45 0.61
9 0.91 0.91 0.87 0.80 0.82 0.81
10 0.62 0.50 0.62 0.56 0.49 0.58
11 0.68 0.44 0.70 0.60 0.61 0.53
12 0.94 0.76 0.95 0.66 0.83 0.67
13 0.71 0.37 0.70 0.63 0.65 0.41
14 0.79 0.68 0.76 0.69 0.68 0.68
15 0.76 0.61 0.70 0.60 0.61 0.50
16 0.74 0.57 0.79 0.62 0.75 0.63
17 0.85 0.63 0.81 0.69 0.80 0.70
18 0.82 0.74 0.73 0.84 0.66 0.85
19 0.80 0.67 0.80 0.65 0.50 0.65
All 0.67 0.57 0.68 0.60 0.59 0.60
Table 4. Comparision of quality measures for each kind of partition
both partitions (mean and quartile) such loop.rotate (am-
biguous sometimes as 0 and others as 1), adce=1, in-
stcombine=0, licm=0, simplifycfg=0, tailcallelim (some-
times as 1 and others as 0). There are other variables
such as functionattrs=1, loop.deletion=1, lcssa=1, gvn=0,
sccp=1, indvars=1, ipscco=1, early.cse=1 that also helped
to classify these specific programs using the mean parti-
tion. Table 3 presents how solutions varies for each pro-
gram when using quartile partition. The main difference
is that the height of the tree is greater for quartile because
the number of classes is also greater. But there were some
new variables noticed (specifically by applying quartiles)
such as strip.dead.prototypes=1, globalopt=1, memdep=1,
loop.deletion=1, prune.eh=1, early.cse=1, loop.unroll=1.
Regarging the quality of the trees, Table 4 summarizes what
was found. The results from the individual programs used
training versus validation set, while the entire program used
training versus test set to avoid data collision and improve
the confidence in the analysis. From a general perspective
(using all programs), the mean partition gave the best pre-
cision (67 %) and Accuraccy (68 %) against 57% and 60%
respectively for . The AUC was not so high (around 60%)
in both cases. In table 4, it was also highlighted in bold
the cases equal or higher than 80% and underlined cases
equal or lower than 50%. The program 9 was the easiest
one to reach good quality levels, while programs 1, 3, had
the worst results (50% and 31% precision). For 1 and 3, the
quartile separation gave a better results (not so expressive
65% and 54% precision), the possible explanation is that
both has few good examples to train. Programs 5, 6, 12,
17, 18 individually had good classification results for mean
separation. Programs 10, 11 had few bad examples to train
and presented and intermediate classification quality result.
References
[1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and regression trees. Wadsworth, 1984. 1

More Related Content

What's hot

Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...
Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...
Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...ijtsrd
 
A NEW OPERATION ON HEXAGONAL FUZZY NUMBER
A NEW OPERATION ON HEXAGONAL FUZZY NUMBERA NEW OPERATION ON HEXAGONAL FUZZY NUMBER
A NEW OPERATION ON HEXAGONAL FUZZY NUMBERijfls
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Regula Falsi (False position) Method
Regula Falsi (False position) MethodRegula Falsi (False position) Method
Regula Falsi (False position) MethodIsaac Yowetu
 
Mws gen nle_ppt_bisection
Mws gen nle_ppt_bisectionMws gen nle_ppt_bisection
Mws gen nle_ppt_bisectionAlvin Setiawan
 
Presentation on application of numerical method in our life
Presentation on application of numerical method in our lifePresentation on application of numerical method in our life
Presentation on application of numerical method in our lifeManish Kumar Singh
 
Stride Random Erasing Augmentation
Stride Random Erasing AugmentationStride Random Erasing Augmentation
Stride Random Erasing Augmentationgerogepatton
 
Analysis of Variance 2
Analysis of Variance 2Analysis of Variance 2
Analysis of Variance 2Mayar Zo
 
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationApplied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationBrian Erandio
 
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical SystemsMATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical SystemsShameer Ahmed Koya
 
Roots of equations
Roots of equations Roots of equations
Roots of equations shopnohinami
 

What's hot (20)

Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...
Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...
Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...
 
Data Analysis Assignment Help
Data Analysis Assignment HelpData Analysis Assignment Help
Data Analysis Assignment Help
 
Bisection and fixed point method
Bisection and fixed point methodBisection and fixed point method
Bisection and fixed point method
 
A NEW OPERATION ON HEXAGONAL FUZZY NUMBER
A NEW OPERATION ON HEXAGONAL FUZZY NUMBERA NEW OPERATION ON HEXAGONAL FUZZY NUMBER
A NEW OPERATION ON HEXAGONAL FUZZY NUMBER
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Es272 ch5a
Es272 ch5aEs272 ch5a
Es272 ch5a
 
Es272 ch1
Es272 ch1Es272 ch1
Es272 ch1
 
Regula Falsi (False position) Method
Regula Falsi (False position) MethodRegula Falsi (False position) Method
Regula Falsi (False position) Method
 
MATLAB - Arrays and Matrices
MATLAB - Arrays and MatricesMATLAB - Arrays and Matrices
MATLAB - Arrays and Matrices
 
Mws gen nle_ppt_bisection
Mws gen nle_ppt_bisectionMws gen nle_ppt_bisection
Mws gen nle_ppt_bisection
 
report
reportreport
report
 
Presentation on application of numerical method in our life
Presentation on application of numerical method in our lifePresentation on application of numerical method in our life
Presentation on application of numerical method in our life
 
Stride Random Erasing Augmentation
Stride Random Erasing AugmentationStride Random Erasing Augmentation
Stride Random Erasing Augmentation
 
Analysis of Variance 2
Analysis of Variance 2Analysis of Variance 2
Analysis of Variance 2
 
Distance Sort
Distance SortDistance Sort
Distance Sort
 
Es272 ch4a
Es272 ch4aEs272 ch4a
Es272 ch4a
 
Mat lab
Mat labMat lab
Mat lab
 
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationApplied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
 
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical SystemsMATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
 
Roots of equations
Roots of equations Roots of equations
Roots of equations
 

Viewers also liked

2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_fariaPaulo Faria
 
2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-projectPaulo Faria
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentationCharles Buie
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_fariaPaulo Faria
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersHarnoor Sanjeev
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)Frank McDonald
 
Postcards final-all
Postcards final-allPostcards final-all
Postcards final-allHashevaynu
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015Andrew Green
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in educationWaqar Nisa
 

Viewers also liked (15)

2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 
2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-project
 
Article_6
Article_6Article_6
Article_6
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentation
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
 
Power Point
Power PointPower Point
Power Point
 
Fa102 b
Fa102 bFa102 b
Fa102 b
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)
 
Postcards final-all
Postcards final-allPostcards final-all
Postcards final-all
 
vjq_cv2015
vjq_cv2015vjq_cv2015
vjq_cv2015
 
Rebellions Excerpt
Rebellions ExcerptRebellions Excerpt
Rebellions Excerpt
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual Dinner
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in education
 

Similar to 2014-mo444-practical-assignment-04-paulo_faria

Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Waqas Tariq
 
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...IJRESJOURNAL
 
Important Terminologies In Statistical Inference I I
Important Terminologies In  Statistical  Inference  I IImportant Terminologies In  Statistical  Inference  I I
Important Terminologies In Statistical Inference I IZoha Qureshi
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...IJAEMSJORNAL
 
Seven Basic Quality Control Tools أدوات ضبط الجودة السبعة
Seven Basic Quality Control Tools أدوات ضبط الجودة السبعةSeven Basic Quality Control Tools أدوات ضبط الجودة السبعة
Seven Basic Quality Control Tools أدوات ضبط الجودة السبعةMohamed Khaled
 
Fault detection based on novel fuzzy modelling
Fault detection based on novel fuzzy modelling Fault detection based on novel fuzzy modelling
Fault detection based on novel fuzzy modelling csijjournal
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsLong Beach City College
 
7-White Box Testing.ppt
7-White Box Testing.ppt7-White Box Testing.ppt
7-White Box Testing.pptHirenderPal
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataIRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
White Box testing by Pankaj Thakur, NITTTR Chandigarh
White Box testing by Pankaj Thakur, NITTTR ChandigarhWhite Box testing by Pankaj Thakur, NITTTR Chandigarh
White Box testing by Pankaj Thakur, NITTTR ChandigarhPankaj Thakur
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 

Similar to 2014-mo444-practical-assignment-04-paulo_faria (20)

Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...
 
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
 
Important Terminologies In Statistical Inference I I
Important Terminologies In  Statistical  Inference  I IImportant Terminologies In  Statistical  Inference  I I
Important Terminologies In Statistical Inference I I
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 
Seven Basic Quality Control Tools أدوات ضبط الجودة السبعة
Seven Basic Quality Control Tools أدوات ضبط الجودة السبعةSeven Basic Quality Control Tools أدوات ضبط الجودة السبعة
Seven Basic Quality Control Tools أدوات ضبط الجودة السبعة
 
Fault detection based on novel fuzzy modelling
Fault detection based on novel fuzzy modelling Fault detection based on novel fuzzy modelling
Fault detection based on novel fuzzy modelling
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and Boxplots
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
 
7-White Box Testing.ppt
7-White Box Testing.ppt7-White Box Testing.ppt
7-White Box Testing.ppt
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
 
Qc tools
Qc toolsQc tools
Qc tools
 
Qc tools
Qc toolsQc tools
Qc tools
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
White Box testing by Pankaj Thakur, NITTTR Chandigarh
White Box testing by Pankaj Thakur, NITTTR ChandigarhWhite Box testing by Pankaj Thakur, NITTTR Chandigarh
White Box testing by Pankaj Thakur, NITTTR Chandigarh
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 

2014-mo444-practical-assignment-04-paulo_faria

  • 1. Applying Machine learning techniques to select variables responsible for compiler performance variation Paulo Renato de Faria∗ Anderson Rocha† 1. Introduction The report will show the results found after applying Ma- chine Learning techniques to discover new ways for opti- mising compiler codes. A researcher in the area of compil- ers performed a series of experiments with the LLVM com- piler, enabling and disabling optimizations independently for each test program (discrete variables). The experiments involve 45 different optimizations (input parameters) and one target variable that is the program runtime (continu- ous variable). The dataset comprises 46,945 examples (with noisy data) divided in 19 different programs (around 2,400 instances for each program). 2. Activities Regarding applying Classification Trees to apply induc- tive inference, that is, reaching general conclusion from specific examples, it can be cited Breiman et al. [1]. It seems appropriate to apply this technique because the input variables are discrete. It was also observed that the origi- nal data contains several repetitions of the same input data, which is another advantage in favour of applying Classifi- cation Trees - its robustness to noisy data. 3. Proposed Solutions It was implemented one algorithm to deal with the prob- lem, developed in R language using rpart function and method=class. 3.1. Classification Trees and Information Gain To build the classification tree one fundamental concept is to find the root node (attribute that best splits the data over). One of the measures used is Entropy (H), which measures the homogeneity of the examples, calculated as below: ∗Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: paulo.faria@gmail.com †Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: anderson.rocha@ic.unicamp.br H(S) = c i=1 (−pi ∗ log2 pi) (1) The tree split function to find non-leaf nodes will be In- formation Gain, which measures the reduction on Entropy as follows: IG(S, A) = H(S) − v values(A) ( |Sv| |S| ) ∗ H(Sv) (2) where Sv is the subset of S for which A has value v. 3.2. Quality measures To access the quality of the results it will be used Preci- sion, Accuracy and AUC (based on ROC curve). Precision is defined as the proportion of the true positives against all the positive results (in the case the reference is the good per- formance items). Accuracy is the proportion of true (cor- rectly identified) results (both true positives and true neg- atives) in the classification. AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. 4. Experiments and Discussion 4.1. Data preprocessing To deal with the large number of repetitions on data, we used the function unique to let only substantial cases to an- alyze. 4.1.1 Data splitting The data was splitted in 3 partitions for each program under analysis using the following propotions: 60% for training, 20% for validation and 20% for testing. This was imple- mented on R as below: s p l i t d f <− function ( dataframe , seed=NULL) { i f ( ! i s . null ( seed ) ) s e t . seed ( seed ) index <− 1: nrow ( dataframe ) 1
  • 2. #60% f o r t r a i n i n g t r a i n i n d e x <− sample ( index , trunc ( length ( index ) ∗ 0 . 6 ) ) t r a i n s e t <− dataframe [ t r a i n i n d e x , ] o t h e r s e t <− dataframe [− t r a i n i n d e x , ] o t h e r I n d e x <− 1: nrow ( o t h e r s e t ) #20% f o r t r a i n i n g and #20% f o r t e s t i n g s e t v a l i d a t i o n I n d e x <− sample ( otherIndex , trunc ( length ( o t h e r I n d e x ) / 2 ) ) v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ] t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ] l i s t ( t r a i n s e t = t r a i n s e t , v a l i d a t i o n s e t = v a l i d a t i o n s e t , t e s t s e t = t e s t s e t ) } The table summarizes number of instances after prepro- cessing phase. Prog. Noisy Unique Train Valid Test Data Data Data (60 %) (20 %) (20 %) 1 2468 172 103 34 35 2 2473 222 133 44 45 3 2470 228 137 46 45 4 2475 217 130 43 44 5 2468 218 131 44 43 6 2479 250 150 50 50 7 2451 219 131 44 44 8 2476 201 121 40 40 9 2468 224 134 45 45 10 2472 197 118 39 40 11 2476 214 128 43 43 12 2472 191 115 38 38 13 2473 228 137 46 45 14 2467 210 126 42 42 15 2468 217 130 43 44 16 2470 168 101 34 33 17 2473 211 127 42 42 18 2478 245 147 49 49 19 2468 199 119 40 40 All 46945 4031 2418 806 807 Table 1. Dividing data in training/validation/testing 4.2. Runtime classification To discretize the runtime values for each instance as Pos- itive (good performance) or Negative (Not so good perfor- mance), we used the function scale to apply z-normalization (centering in the mean and dividing by the standard devia- tion σ). The first rule applied was the following partition around the mean: •if(z−norm−runtime < 0) ⇒ ”Good−performance” •if(z−norm−runtime >= 0) ⇒ ”Bad−performance” This approach is based on the histogram of the items (as the example at Figure 1 for program 1). It was also plotted the boxplots (Figure 2) to understand if there are outliers and their distributions. Figure 1. Histogram of the program 1. Figure 2. Box plot of the program 1 runtime distribution. The second rule applied was the following partition around the quartiles: •if(z−norm−runtime < 25%quartile) ⇒ ”V eryGood−performance •if(z−norm−runtime < 50%quartile) ⇒ ”Good−performance” •if(z−norm−runtime < 75%quartile) ⇒ ”Bad−performance” •if(z−norm−runtime < 25%quartile) ⇒ ”V eryBad−performance” 4.3. Results 4.3.1 Partition around the mean The classification tree for the entire dataset is at Figure 3. To summarize the individual trees found using the mean separation, a table with the 5 initial params found was cre- ated (see Table 2) 4.3.2 Partition using quartiles The classification tree for the entire dataset is at Figure 4.
  • 3. Prog. DT Height Pruned Par 1 Par2 Par3 Par4 Par5 1 1 1 basicaa=1 2 7 2 sroa=0 loop.rotate=0 3 10 10 simplifycfg=0 sroa=0 gvn=0 memcpyopt=0 jump.threading=1 4 8 3 instcombine=0 sroa=0 loop.rotate=1 adce=1 functionattrs=1 5 2 2 licm=0 loop.rotate=0 6 4 3 sroa=0 simplycfg=0 instcombine=0 basicaa=0 7 1 1 sroa=1 8 1 1 tailcallelim=0 9 4 3 sroa=0 inline=0 loop.rotate=1 loop.deletion=0 10 4 1 functionattrs=0 loop.simplify=0 verify=0 simplifycfg=0 11 1 1 sroa=0 12 1 1 sroa=0 13 6 2 sroa=0 loop.rotate=1 globalopt=0 loop.deletion=1 lcssa=1 14 3 3 sroa=0 inlinecost=0 gvn=0 15 10 2 loop.rotate=0 tailcallelim=1 deadargelim=0 instcombine=1 16 2 2 inline=0 loop.rotate=0 17 4 3 simplifycfg=0 sroa=0 basicaa=0 instcombine=0 18 5 5 loop.rotate=0 sccp=1 indvars=1 ipscco=1 early.cse=1 19 1 1 sroa=0 All 2 2 sroa=0 inline=0 Table 2. Tree size and 5 top paramaters to find good performance for mean partition Figure 3. Classification tree for all programs using mean. To summarize the individual trees found using the quar- tile separation, a table with the 5 initial params found was created (see Table 3) 4.3.3 Classification trees quality measures 5. Conclusions and Future Work Analysing Figure 3 and Figure 4 it is possible to find the set of features which are most important for optimiz- ing the code when using all the programs at the same time. In both cases (mean and quartile separation), sroa=0 (a few times as 1) and inline=0 where the first parameters in com- mon. For the quartile separation it was also possible to Figure 4. Classification tree for all programs using quartile. use simplifycfg=0, gvn=0, basicaa=0,jump.threading=0 as a way to classify in the first quartile (that are the best run- time values). Table 2 presents how the solutions varies for each program when applying mean partition. We used a cross validation (using test set) technique to find the pruned height size. Not all programs need sroa=0 and in- line=0 as it would be expected. Some are common for
  • 4. Prog. Height Pruned Par 1 Par2 Par3 Par4 Par5 1 8 4 basicaa=1 licm=0 strip.dead.prototypes=1 preverify=1 2 10 10 sroa=0 strip.dead.prototypes=0 basiccg=1 basicaa=1 scalar.evolution=1 3 7 7 sroa=0 simplifycfg=0 globalopt=1 memcep=1 loop.deletion=1 4 3 7 instcombine=0 sroa=0 loop.rotate=1 5 6 6 licm=0 loop.rotate=0 loop.idiom=1 instcombine=0 strip.dead.prototypes=0 6 7 2 sroa=0 loop.rotate=1 7 9 3 sroa=1 simplifycfg=0 basicaa=1 8 8 6 tailcallelim=0 basicaa=0 memdep=1 early.cse=1 loop.unroll=1 9 8 4 sroa=0 inline=0 loop.rotate=1 instcombine=1 10 9 9 loop.rotate=1 memdep=0 simplifycfg=0 basicaa=0 preverify=1 11 10 7 sroa=0 strip.dead.prototypes=0 basiccg=1 deadargelim=0 domtree=0 12 7 7 sroa=0 instcombine=1 loop.rotate=0 targetlibinfo=0 prune.eh=1 13 10 1 loop.rotate=1 14 7 2 sroa=0 inline.cost=0 15 6 5 loop.rotate=0 tailcallelim=0 prune.eh=0 correlated.propagation=0 preverify=0 16 8 7 inline=0 loop.rotate=0 jump.threading=0 targetlibinfo=1 notti=1 17 11 5 sroa=0 simplifycfg=0 basicaa=0 instcombine=0 deadargelim=0 18 8 3 sroa=0 indvars=0 constmerge=0 19 7 1 loop.rotate=1 All 6 6 sroa=0 inline=0 simplifycfg=0 gvn=0 basicaa=0 Table 3. Tree size and 5 top paramaters for quartile partition to find good performance Prog. Prec. Prec. Acc. Acc. AUC AUC Mean Quartiles Mean Quartiles Mean Quartiles 1 0.50 0.65 0.74 0.59 0.54 0.59 2 0.69 0.57 0.64 0.57 0.61 0.42 3 0.31 0.54 0.72 0.61 0.52 0.64 4 0.76 0.71 0.74 0.70 0.68 0.70 5 0.82 0.50 0.84 0.59 0.78 0.59 6 0.91 0.74 0.80 0.72 0.80 0.72 7 0.80 0.63 0.57 0.61 0.55 0.52 8 0.75 0.50 0.75 0.58 0.45 0.61 9 0.91 0.91 0.87 0.80 0.82 0.81 10 0.62 0.50 0.62 0.56 0.49 0.58 11 0.68 0.44 0.70 0.60 0.61 0.53 12 0.94 0.76 0.95 0.66 0.83 0.67 13 0.71 0.37 0.70 0.63 0.65 0.41 14 0.79 0.68 0.76 0.69 0.68 0.68 15 0.76 0.61 0.70 0.60 0.61 0.50 16 0.74 0.57 0.79 0.62 0.75 0.63 17 0.85 0.63 0.81 0.69 0.80 0.70 18 0.82 0.74 0.73 0.84 0.66 0.85 19 0.80 0.67 0.80 0.65 0.50 0.65 All 0.67 0.57 0.68 0.60 0.59 0.60 Table 4. Comparision of quality measures for each kind of partition both partitions (mean and quartile) such loop.rotate (am- biguous sometimes as 0 and others as 1), adce=1, in- stcombine=0, licm=0, simplifycfg=0, tailcallelim (some- times as 1 and others as 0). There are other variables such as functionattrs=1, loop.deletion=1, lcssa=1, gvn=0, sccp=1, indvars=1, ipscco=1, early.cse=1 that also helped to classify these specific programs using the mean parti- tion. Table 3 presents how solutions varies for each pro- gram when using quartile partition. The main difference is that the height of the tree is greater for quartile because the number of classes is also greater. But there were some new variables noticed (specifically by applying quartiles) such as strip.dead.prototypes=1, globalopt=1, memdep=1, loop.deletion=1, prune.eh=1, early.cse=1, loop.unroll=1. Regarging the quality of the trees, Table 4 summarizes what was found. The results from the individual programs used training versus validation set, while the entire program used training versus test set to avoid data collision and improve the confidence in the analysis. From a general perspective (using all programs), the mean partition gave the best pre- cision (67 %) and Accuraccy (68 %) against 57% and 60% respectively for . The AUC was not so high (around 60%) in both cases. In table 4, it was also highlighted in bold the cases equal or higher than 80% and underlined cases equal or lower than 50%. The program 9 was the easiest one to reach good quality levels, while programs 1, 3, had the worst results (50% and 31% precision). For 1 and 3, the quartile separation gave a better results (not so expressive 65% and 54% precision), the possible explanation is that both has few good examples to train. Programs 5, 6, 12, 17, 18 individually had good classification results for mean separation. Programs 10, 11 had few bad examples to train and presented and intermediate classification quality result. References [1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth, 1984. 1