Data Science Interview Questions | Data Science Interview Questions And Answe...
2014-mo444-practical-assignment-04-paulo_faria
1. Applying Machine learning techniques to select variables responsible for
compiler performance variation
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The report will show the results found after applying Ma-
chine Learning techniques to discover new ways for opti-
mising compiler codes. A researcher in the area of compil-
ers performed a series of experiments with the LLVM com-
piler, enabling and disabling optimizations independently
for each test program (discrete variables). The experiments
involve 45 different optimizations (input parameters) and
one target variable that is the program runtime (continu-
ous variable). The dataset comprises 46,945 examples (with
noisy data) divided in 19 different programs (around 2,400
instances for each program).
2. Activities
Regarding applying Classification Trees to apply induc-
tive inference, that is, reaching general conclusion from
specific examples, it can be cited Breiman et al. [1]. It
seems appropriate to apply this technique because the input
variables are discrete. It was also observed that the origi-
nal data contains several repetitions of the same input data,
which is another advantage in favour of applying Classifi-
cation Trees - its robustness to noisy data.
3. Proposed Solutions
It was implemented one algorithm to deal with the prob-
lem, developed in R language using rpart function and
method=class.
3.1. Classification Trees and Information Gain
To build the classification tree one fundamental concept
is to find the root node (attribute that best splits the data
over). One of the measures used is Entropy (H), which
measures the homogeneity of the examples, calculated as
below:
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
H(S) =
c
i=1
(−pi ∗ log2 pi) (1)
The tree split function to find non-leaf nodes will be In-
formation Gain, which measures the reduction on Entropy
as follows:
IG(S, A) = H(S) −
v values(A)
(
|Sv|
|S|
) ∗ H(Sv) (2)
where Sv is the subset of S for which A has value v.
3.2. Quality measures
To access the quality of the results it will be used Preci-
sion, Accuracy and AUC (based on ROC curve). Precision
is defined as the proportion of the true positives against all
the positive results (in the case the reference is the good per-
formance items). Accuracy is the proportion of true (cor-
rectly identified) results (both true positives and true neg-
atives) in the classification. AUC is the probability that
a classifier will rank a randomly chosen positive instance
higher than a randomly chosen negative one.
4. Experiments and Discussion
4.1. Data preprocessing
To deal with the large number of repetitions on data, we
used the function unique to let only substantial cases to an-
alyze.
4.1.1 Data splitting
The data was splitted in 3 partitions for each program under
analysis using the following propotions: 60% for training,
20% for validation and 20% for testing. This was imple-
mented on R as below:
s p l i t d f <− function ( dataframe , seed=NULL) {
i f ( ! i s . null ( seed ) ) s e t . seed ( seed )
index <− 1: nrow ( dataframe )
1
2. #60% f o r t r a i n i n g
t r a i n i n d e x <− sample ( index ,
trunc ( length ( index ) ∗ 0 . 6 ) )
t r a i n s e t <− dataframe [ t r a i n i n d e x , ]
o t h e r s e t <− dataframe [− t r a i n i n d e x , ]
o t h e r I n d e x <− 1: nrow ( o t h e r s e t )
#20% f o r t r a i n i n g and
#20% f o r t e s t i n g s e t
v a l i d a t i o n I n d e x <−
sample ( otherIndex ,
trunc ( length ( o t h e r I n d e x ) / 2 ) )
v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ]
t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ]
l i s t ( t r a i n s e t = t r a i n s e t ,
v a l i d a t i o n s e t = v a l i d a t i o n s e t ,
t e s t s e t = t e s t s e t )
}
The table summarizes number of instances after prepro-
cessing phase.
Prog. Noisy Unique Train Valid Test
Data Data Data (60 %) (20 %) (20 %)
1 2468 172 103 34 35
2 2473 222 133 44 45
3 2470 228 137 46 45
4 2475 217 130 43 44
5 2468 218 131 44 43
6 2479 250 150 50 50
7 2451 219 131 44 44
8 2476 201 121 40 40
9 2468 224 134 45 45
10 2472 197 118 39 40
11 2476 214 128 43 43
12 2472 191 115 38 38
13 2473 228 137 46 45
14 2467 210 126 42 42
15 2468 217 130 43 44
16 2470 168 101 34 33
17 2473 211 127 42 42
18 2478 245 147 49 49
19 2468 199 119 40 40
All 46945 4031 2418 806 807
Table 1. Dividing data in training/validation/testing
4.2. Runtime classification
To discretize the runtime values for each instance as Pos-
itive (good performance) or Negative (Not so good perfor-
mance), we used the function scale to apply z-normalization
(centering in the mean and dividing by the standard devia-
tion σ).
The first rule applied was the following partition around
the mean:
•if(z−norm−runtime < 0) ⇒ ”Good−performance”
•if(z−norm−runtime >= 0) ⇒ ”Bad−performance”
This approach is based on the histogram of the items (as
the example at Figure 1 for program 1).
It was also plotted the boxplots (Figure 2) to understand
if there are outliers and their distributions.
Figure 1. Histogram of the program 1.
Figure 2. Box plot of the program 1 runtime distribution.
The second rule applied was the following partition
around the quartiles:
•if(z−norm−runtime < 25%quartile) ⇒ ”V eryGood−performance
•if(z−norm−runtime < 50%quartile) ⇒ ”Good−performance”
•if(z−norm−runtime < 75%quartile) ⇒ ”Bad−performance”
•if(z−norm−runtime < 25%quartile) ⇒ ”V eryBad−performance”
4.3. Results
4.3.1 Partition around the mean
The classification tree for the entire dataset is at Figure 3.
To summarize the individual trees found using the mean
separation, a table with the 5 initial params found was cre-
ated (see Table 2)
4.3.2 Partition using quartiles
The classification tree for the entire dataset is at Figure 4.
3. Prog. DT Height Pruned Par 1 Par2 Par3 Par4 Par5
1 1 1 basicaa=1
2 7 2 sroa=0 loop.rotate=0
3 10 10 simplifycfg=0 sroa=0 gvn=0 memcpyopt=0 jump.threading=1
4 8 3 instcombine=0 sroa=0 loop.rotate=1 adce=1 functionattrs=1
5 2 2 licm=0 loop.rotate=0
6 4 3 sroa=0 simplycfg=0 instcombine=0 basicaa=0
7 1 1 sroa=1
8 1 1 tailcallelim=0
9 4 3 sroa=0 inline=0 loop.rotate=1 loop.deletion=0
10 4 1 functionattrs=0 loop.simplify=0 verify=0 simplifycfg=0
11 1 1 sroa=0
12 1 1 sroa=0
13 6 2 sroa=0 loop.rotate=1 globalopt=0 loop.deletion=1 lcssa=1
14 3 3 sroa=0 inlinecost=0 gvn=0
15 10 2 loop.rotate=0 tailcallelim=1 deadargelim=0 instcombine=1
16 2 2 inline=0 loop.rotate=0
17 4 3 simplifycfg=0 sroa=0 basicaa=0 instcombine=0
18 5 5 loop.rotate=0 sccp=1 indvars=1 ipscco=1 early.cse=1
19 1 1 sroa=0
All 2 2 sroa=0 inline=0
Table 2. Tree size and 5 top paramaters to find good performance for mean partition
Figure 3. Classification tree for all programs using mean.
To summarize the individual trees found using the quar-
tile separation, a table with the 5 initial params found was
created (see Table 3)
4.3.3 Classification trees quality measures
5. Conclusions and Future Work
Analysing Figure 3 and Figure 4 it is possible to find
the set of features which are most important for optimiz-
ing the code when using all the programs at the same time.
In both cases (mean and quartile separation), sroa=0 (a few
times as 1) and inline=0 where the first parameters in com-
mon. For the quartile separation it was also possible to
Figure 4. Classification tree for all programs using quartile.
use simplifycfg=0, gvn=0, basicaa=0,jump.threading=0 as
a way to classify in the first quartile (that are the best run-
time values). Table 2 presents how the solutions varies
for each program when applying mean partition. We used
a cross validation (using test set) technique to find the
pruned height size. Not all programs need sroa=0 and in-
line=0 as it would be expected. Some are common for
4. Prog. Height Pruned Par 1 Par2 Par3 Par4 Par5
1 8 4 basicaa=1 licm=0 strip.dead.prototypes=1 preverify=1
2 10 10 sroa=0 strip.dead.prototypes=0 basiccg=1 basicaa=1 scalar.evolution=1
3 7 7 sroa=0 simplifycfg=0 globalopt=1 memcep=1 loop.deletion=1
4 3 7 instcombine=0 sroa=0 loop.rotate=1
5 6 6 licm=0 loop.rotate=0 loop.idiom=1 instcombine=0 strip.dead.prototypes=0
6 7 2 sroa=0 loop.rotate=1
7 9 3 sroa=1 simplifycfg=0 basicaa=1
8 8 6 tailcallelim=0 basicaa=0 memdep=1 early.cse=1 loop.unroll=1
9 8 4 sroa=0 inline=0 loop.rotate=1 instcombine=1
10 9 9 loop.rotate=1 memdep=0 simplifycfg=0 basicaa=0 preverify=1
11 10 7 sroa=0 strip.dead.prototypes=0 basiccg=1 deadargelim=0 domtree=0
12 7 7 sroa=0 instcombine=1 loop.rotate=0 targetlibinfo=0 prune.eh=1
13 10 1 loop.rotate=1
14 7 2 sroa=0 inline.cost=0
15 6 5 loop.rotate=0 tailcallelim=0 prune.eh=0 correlated.propagation=0 preverify=0
16 8 7 inline=0 loop.rotate=0 jump.threading=0 targetlibinfo=1 notti=1
17 11 5 sroa=0 simplifycfg=0 basicaa=0 instcombine=0 deadargelim=0
18 8 3 sroa=0 indvars=0 constmerge=0
19 7 1 loop.rotate=1
All 6 6 sroa=0 inline=0 simplifycfg=0 gvn=0 basicaa=0
Table 3. Tree size and 5 top paramaters for quartile partition to find good performance
Prog. Prec. Prec. Acc. Acc. AUC AUC
Mean Quartiles Mean Quartiles Mean Quartiles
1 0.50 0.65 0.74 0.59 0.54 0.59
2 0.69 0.57 0.64 0.57 0.61 0.42
3 0.31 0.54 0.72 0.61 0.52 0.64
4 0.76 0.71 0.74 0.70 0.68 0.70
5 0.82 0.50 0.84 0.59 0.78 0.59
6 0.91 0.74 0.80 0.72 0.80 0.72
7 0.80 0.63 0.57 0.61 0.55 0.52
8 0.75 0.50 0.75 0.58 0.45 0.61
9 0.91 0.91 0.87 0.80 0.82 0.81
10 0.62 0.50 0.62 0.56 0.49 0.58
11 0.68 0.44 0.70 0.60 0.61 0.53
12 0.94 0.76 0.95 0.66 0.83 0.67
13 0.71 0.37 0.70 0.63 0.65 0.41
14 0.79 0.68 0.76 0.69 0.68 0.68
15 0.76 0.61 0.70 0.60 0.61 0.50
16 0.74 0.57 0.79 0.62 0.75 0.63
17 0.85 0.63 0.81 0.69 0.80 0.70
18 0.82 0.74 0.73 0.84 0.66 0.85
19 0.80 0.67 0.80 0.65 0.50 0.65
All 0.67 0.57 0.68 0.60 0.59 0.60
Table 4. Comparision of quality measures for each kind of partition
both partitions (mean and quartile) such loop.rotate (am-
biguous sometimes as 0 and others as 1), adce=1, in-
stcombine=0, licm=0, simplifycfg=0, tailcallelim (some-
times as 1 and others as 0). There are other variables
such as functionattrs=1, loop.deletion=1, lcssa=1, gvn=0,
sccp=1, indvars=1, ipscco=1, early.cse=1 that also helped
to classify these specific programs using the mean parti-
tion. Table 3 presents how solutions varies for each pro-
gram when using quartile partition. The main difference
is that the height of the tree is greater for quartile because
the number of classes is also greater. But there were some
new variables noticed (specifically by applying quartiles)
such as strip.dead.prototypes=1, globalopt=1, memdep=1,
loop.deletion=1, prune.eh=1, early.cse=1, loop.unroll=1.
Regarging the quality of the trees, Table 4 summarizes what
was found. The results from the individual programs used
training versus validation set, while the entire program used
training versus test set to avoid data collision and improve
the confidence in the analysis. From a general perspective
(using all programs), the mean partition gave the best pre-
cision (67 %) and Accuraccy (68 %) against 57% and 60%
respectively for . The AUC was not so high (around 60%)
in both cases. In table 4, it was also highlighted in bold
the cases equal or higher than 80% and underlined cases
equal or lower than 50%. The program 9 was the easiest
one to reach good quality levels, while programs 1, 3, had
the worst results (50% and 31% precision). For 1 and 3, the
quartile separation gave a better results (not so expressive
65% and 54% precision), the possible explanation is that
both has few good examples to train. Programs 5, 6, 12,
17, 18 individually had good classification results for mean
separation. Programs 10, 11 had few bad examples to train
and presented and intermediate classification quality result.
References
[1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and regression trees. Wadsworth, 1984. 1