APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...cscpconf
EM algorithm is popular in maximum likelihood estimation of parameters for state-space models. However, extant approaches for the realization of EM algorithm are still not able to fulfill the task of identification systems, which have external inputs and constrained parameters. In this paper, we propose new approaches for both initial guessing and MLE of the parameters of a constrained state-space model with an external input. Using weighted least square for the initial guess and the partial differentiation of the joint log-likelihood function for the EM algorithm, we estimate the parameters and compare the estimated values with the “actual” values, which are set to generate simulation data. Moreover, asymptotic variances of the estimated parameters are calculated when the sample size is large, while statistics of the estimated parameters are obtained through bootstrapping when the sample size issmall. The results demonstrate that the estimated values are close to the “actual” values.Consequently, our approaches are promising and can applied in future research.
I am Watson A. I am a Statistics Assignment Expert at statisticsassignmenthelp.com. I hold a Masters in Statistics from, Liberty University, USA
I have been helping students with their homework for the past 6 years. I solve assignments related to Statistics.
Visit statisticsassignmenthelp.com or email info@statisticsassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Statistics Assignments.
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATIONorajjournal
This paper presents a lognormal ordinary kriging (LOK) metamodel algorithm and its application to
optimize a stochastic simulation problem. Kriging models have been developed as an interpolation method
in geology. They have been successfully used for the deterministic simulation optimization (SO) problem. In
recent years, kriging metamodeling has attracted a growing interest with stochastic problems. SO
researchers have begun using ordinary kriging through global optimization in stochastic systems. The
goals of this study are to present LOK metamodel algorithm and to analyze the result of the application
step-by-step. The results show that LOK is a powerful alternative metamodel in simulation optimization
when the data are too skewed.
On the State Observer Based Stabilization of T-S Systems with Maximum Converg...CSCJournals
This paper presents improved relaxed stabilization conditions and design procedures of state observers based controllers for continuous nonlinear systems in T-S model representation. First, the T-S model approach for nonlinear systems and some stabilization results are recalled. New stabilization conditions are obtained by relaxing those derived in previous works in this field. The asymptotic and exponential stabilization are considered with the maximization of the convergence rate. A design procedure for stabilizing T-S observer based controller using the concept of PDC (Parallel Distributed Compensation) and the improved relaxed stabilization conditions is proposed.
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...ijtsrd
In this paper, a class of generalized chaotic systems with exponential nonlinearity is studied and the state observation problem of such systems is explored. Using differential inequality with time domain analysis, a practical state observer for such generalized chaotic systems is constructed to ensure the global exponential stability of the resulting error system. Besides, the guaranteed exponential decay rate can be correctly estimated. Finally, several numerical simulations are given to demonstrate the validity, effectiveness, and correctness of the obtained result. Yeong-Jeu Sun "An Exponential Observer Design for a Class of Chaotic Systems with Exponential Nonlinearity" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-1 , December 2020, URL: https://www.ijtsrd.com/papers/ijtsrd38233.pdf Paper URL : https://www.ijtsrd.com/engineering/electrical-engineering/38233/an-exponential-observer-design-for-a-class-of-chaotic-systems-with-exponential-nonlinearity/yeongjeu-sun
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...cscpconf
EM algorithm is popular in maximum likelihood estimation of parameters for state-space models. However, extant approaches for the realization of EM algorithm are still not able to fulfill the task of identification systems, which have external inputs and constrained parameters. In this paper, we propose new approaches for both initial guessing and MLE of the parameters of a constrained state-space model with an external input. Using weighted least square for the initial guess and the partial differentiation of the joint log-likelihood function for the EM algorithm, we estimate the parameters and compare the estimated values with the “actual” values, which are set to generate simulation data. Moreover, asymptotic variances of the estimated parameters are calculated when the sample size is large, while statistics of the estimated parameters are obtained through bootstrapping when the sample size issmall. The results demonstrate that the estimated values are close to the “actual” values.Consequently, our approaches are promising and can applied in future research.
I am Watson A. I am a Statistics Assignment Expert at statisticsassignmenthelp.com. I hold a Masters in Statistics from, Liberty University, USA
I have been helping students with their homework for the past 6 years. I solve assignments related to Statistics.
Visit statisticsassignmenthelp.com or email info@statisticsassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Statistics Assignments.
LOGNORMAL ORDINARY KRIGING METAMODEL IN SIMULATION OPTIMIZATIONorajjournal
This paper presents a lognormal ordinary kriging (LOK) metamodel algorithm and its application to
optimize a stochastic simulation problem. Kriging models have been developed as an interpolation method
in geology. They have been successfully used for the deterministic simulation optimization (SO) problem. In
recent years, kriging metamodeling has attracted a growing interest with stochastic problems. SO
researchers have begun using ordinary kriging through global optimization in stochastic systems. The
goals of this study are to present LOK metamodel algorithm and to analyze the result of the application
step-by-step. The results show that LOK is a powerful alternative metamodel in simulation optimization
when the data are too skewed.
On the State Observer Based Stabilization of T-S Systems with Maximum Converg...CSCJournals
This paper presents improved relaxed stabilization conditions and design procedures of state observers based controllers for continuous nonlinear systems in T-S model representation. First, the T-S model approach for nonlinear systems and some stabilization results are recalled. New stabilization conditions are obtained by relaxing those derived in previous works in this field. The asymptotic and exponential stabilization are considered with the maximization of the convergence rate. A design procedure for stabilizing T-S observer based controller using the concept of PDC (Parallel Distributed Compensation) and the improved relaxed stabilization conditions is proposed.
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...ijtsrd
In this paper, a class of generalized chaotic systems with exponential nonlinearity is studied and the state observation problem of such systems is explored. Using differential inequality with time domain analysis, a practical state observer for such generalized chaotic systems is constructed to ensure the global exponential stability of the resulting error system. Besides, the guaranteed exponential decay rate can be correctly estimated. Finally, several numerical simulations are given to demonstrate the validity, effectiveness, and correctness of the obtained result. Yeong-Jeu Sun "An Exponential Observer Design for a Class of Chaotic Systems with Exponential Nonlinearity" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-1 , December 2020, URL: https://www.ijtsrd.com/papers/ijtsrd38233.pdf Paper URL : https://www.ijtsrd.com/engineering/electrical-engineering/38233/an-exponential-observer-design-for-a-class-of-chaotic-systems-with-exponential-nonlinearity/yeongjeu-sun
Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...ijtsrd
In this paper, a class of uncertain chaotic and non-chaotic systems is firstly introduced and the state observation problem of such systems is explored. Based on the time-domain approach with integral and differential equalities, an exponential state observer for a class of uncertain nonlinear systems is established to guarantee the global exponential stability of the resulting error system. Besides, the guaranteed exponential convergence rate can be calculated correctly. Finally, numerical simulations are presented to exhibit the feasibility and effectiveness of the obtained results. Yeong-Jeu Sun "Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Chaotic Systems" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-1 , December 2018, URL: http://www.ijtsrd.com/papers/ijtsrd20219.pdf
http://www.ijtsrd.com/engineering/electrical-engineering/20219/exponential-state-observer-design-for-a-class-of-uncertain-chaotic-and-non-chaotic-systems/yeong-jeu-sun
I am Jack U. I am a Matlab Assignment Expert at matlabassignmentexperts.com. I hold a Ph.D. in Matlab, Middlesex University, UK. I have been helping students with their homework for the past 8 years. I solve assignments related to Data Analysis.
Visit matlabassignmentexperts.com or email info@matlabassignmentexperts.com. You can also call on +1 678 648 4277 for any assistance with Data Analysis Assignments.
The Fuzzy set Theory has been applied in many fields such as Management, Engineering etc. In this paper a new operation on Hexagonal Fuzzy number is defined where the methods of addition, subtraction, and multiplication has been modified with some conditions. The main aim of this paper is to introduce a new operation for addition, subtraction and multiplication of Hexagonal Fuzzy number on the basis of alpha cut sets of fuzzy numbers
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Regula Falsi or False Position Method is one of the iterative (bracketing) Method for solving root(s) of nonlinear equation under Numerical Methods or Analysis.
In MATLAB, a vector is created by assigning the elements of the vector to a variable. This can be done in several ways depending on the source of the information.
—Enter an explicit list of elements
—Load matrices from external data files
—Using built-in functions
—Using own functions in M-files
This paper presents a new method for data augmentation called Stride Random Erasing Augmentation (SREA) to improve classification performance. In SREA, probability based
strides of one image are pasted onto another image and also labels of both images are mixed with the same probability as the image mixing, to generate a new augmented image and
augmented label. Stride augmentation overcomes limitations of the popular random erasing data augmentation method, where a random portion of an image is erased with 0 or 255 or the
mean of a dataset without considering the location of the important feature(s) within the image.
A variety of experiments have been performed using different network flavours and the popular
datasets including fashion-MNIST, CIFAR10, CIFAR100 and STL10. The experiments showed
that SREA is more generalized than both the baseline and random erasing method.
Furthermore, the effect of stride size in SREA was investigated by performing experiments with
different stride sizes. Random stride size showed better performance. SREA outperforms the
baseline and random erasing especially on the fashion-MNIST dataset. To enable the reuse,
reproduction and extension of SREA, the source code is provided in a public git repository
https://github.com/kmr2017/stride-aug.
One of the fundamental issues in computer science is ordering a list of items. Although there is a number of sorting algorithms, sorting problem has attracted a great deal of research, because efficient sorting is important to optimize the use of other algorithms. This paper presents a new sorting algorithm which runs faster by decreasing the number of comparisons by taking some extra memory. In this algorithm we are using lists to sort the elements. This algorithm was analyzed, implemented and tested and the results are promising for a random data
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationBrian Erandio
Correction with the misspelled langrange.
and credits to the owners of the pictures (Fantasmagoria01, eugene-kukulka, vooga, and etc.) . I do not own all of the pictures used as background sorry to those who aren't tagged.
The presentation contains topics from Applied Numerical Methods with MATHLAB for Engineers and Scientist 6th and International Edition.
Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Ch...ijtsrd
In this paper, a class of uncertain chaotic and non-chaotic systems is firstly introduced and the state observation problem of such systems is explored. Based on the time-domain approach with integral and differential equalities, an exponential state observer for a class of uncertain nonlinear systems is established to guarantee the global exponential stability of the resulting error system. Besides, the guaranteed exponential convergence rate can be calculated correctly. Finally, numerical simulations are presented to exhibit the feasibility and effectiveness of the obtained results. Yeong-Jeu Sun "Exponential State Observer Design for a Class of Uncertain Chaotic and Non-Chaotic Systems" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-1 , December 2018, URL: http://www.ijtsrd.com/papers/ijtsrd20219.pdf
http://www.ijtsrd.com/engineering/electrical-engineering/20219/exponential-state-observer-design-for-a-class-of-uncertain-chaotic-and-non-chaotic-systems/yeong-jeu-sun
I am Jack U. I am a Matlab Assignment Expert at matlabassignmentexperts.com. I hold a Ph.D. in Matlab, Middlesex University, UK. I have been helping students with their homework for the past 8 years. I solve assignments related to Data Analysis.
Visit matlabassignmentexperts.com or email info@matlabassignmentexperts.com. You can also call on +1 678 648 4277 for any assistance with Data Analysis Assignments.
The Fuzzy set Theory has been applied in many fields such as Management, Engineering etc. In this paper a new operation on Hexagonal Fuzzy number is defined where the methods of addition, subtraction, and multiplication has been modified with some conditions. The main aim of this paper is to introduce a new operation for addition, subtraction and multiplication of Hexagonal Fuzzy number on the basis of alpha cut sets of fuzzy numbers
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Regula Falsi or False Position Method is one of the iterative (bracketing) Method for solving root(s) of nonlinear equation under Numerical Methods or Analysis.
In MATLAB, a vector is created by assigning the elements of the vector to a variable. This can be done in several ways depending on the source of the information.
—Enter an explicit list of elements
—Load matrices from external data files
—Using built-in functions
—Using own functions in M-files
This paper presents a new method for data augmentation called Stride Random Erasing Augmentation (SREA) to improve classification performance. In SREA, probability based
strides of one image are pasted onto another image and also labels of both images are mixed with the same probability as the image mixing, to generate a new augmented image and
augmented label. Stride augmentation overcomes limitations of the popular random erasing data augmentation method, where a random portion of an image is erased with 0 or 255 or the
mean of a dataset without considering the location of the important feature(s) within the image.
A variety of experiments have been performed using different network flavours and the popular
datasets including fashion-MNIST, CIFAR10, CIFAR100 and STL10. The experiments showed
that SREA is more generalized than both the baseline and random erasing method.
Furthermore, the effect of stride size in SREA was investigated by performing experiments with
different stride sizes. Random stride size showed better performance. SREA outperforms the
baseline and random erasing especially on the fashion-MNIST dataset. To enable the reuse,
reproduction and extension of SREA, the source code is provided in a public git repository
https://github.com/kmr2017/stride-aug.
One of the fundamental issues in computer science is ordering a list of items. Although there is a number of sorting algorithms, sorting problem has attracted a great deal of research, because efficient sorting is important to optimize the use of other algorithms. This paper presents a new sorting algorithm which runs faster by decreasing the number of comparisons by taking some extra memory. In this algorithm we are using lists to sort the elements. This algorithm was analyzed, implemented and tested and the results are promising for a random data
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationBrian Erandio
Correction with the misspelled langrange.
and credits to the owners of the pictures (Fantasmagoria01, eugene-kukulka, vooga, and etc.) . I do not own all of the pictures used as background sorry to those who aren't tagged.
The presentation contains topics from Applied Numerical Methods with MATHLAB for Engineers and Scientist 6th and International Edition.
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersHarnoor Sanjeev
Proposal Writers are sandwiched between Pink, Red and Gold reviews; capture, capture management, federal contracts, federal government, government contracting, Management, proposal coordination, proposal life cycle, proposal management, proposal reviews, proposal writing
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Waqas Tariq
A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes skewness and kurtosis. This paper emphasizes the real time computational problem for generally the rth standardized moments and specially for both skewness and kurtosis. It has therefore been important to derive an optimum computational technique for the standardized moments. A new algorithm has been designed for the evaluation of the standardized moments. The evaluation of error analysis has been discussed. The new algorithm saved computational energy by approximately 99.95% than that of the previously published algorithms.
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...IJRESJOURNAL
With the development of productivity and the fast growth of the economy, environmental pollution, resource utilization and low product recovery rate have emerged subsequently, so more and more attention has been paid to the recycling and reuse of products. However, since the complexity of disassembly line balancing problem (DLBP) increases with the number of parts in the product, finding the optimal balance is computationally intensive. In order to improve the computational ability of particle swarm optimization (PSO) algorithm in solving DLBP, this paper proposed an improved adaptive multi-objective particle swarm optimization (IAMOPSO) algorithm. Firstly, the evolution factor parameter is introduced to judge the state of evolution using the idea of fuzzy classification and then the feedback information from evolutionary environment is served in adjusting inertia weight, acceleration coefficients dynamically. Finally, a dimensional learning strategy based on information entropy is used in which each learning object is uncertain. The results from testing in using series of instances with different size verify the effect of proposed algorithm.
BPSO&1-NN algorithm-based variable selection for power system stability ident...IJAEMSJORNAL
Due to the very high nonlinearity of the power system, traditional analytical methods take a lot of time to solve, causing delay in decision-making. Therefore, quickly detecting power system instability helps the control system to make timely decisions become the key factor to ensure stable operation of the power system. Power system stability identification encounters large data set size problem. The need is to select representative variables as input variables for the identifier. This paper proposes to apply wrapper method to select variables. In which, Binary Particle Swarm Optimization (BPSO) algorithm combines with K-NN (K=1) identifier to search for good set of variables. It is named BPSO&1-NN. Test results on IEEE 39-bus diagram show that the proposed method achieves the goal of reducing variables with high accuracy.
Seven Basic Quality Control Tools أدوات ضبط الجودة السبعةMohamed Khaled
The 7 QC tools are fundamental instruments to improve the process and product quality. They are used to examine the production process.
► The seven basic tools are:
1- Check sheet
2- Pareto analysis
3- Cause and Effect Diagram
4- Scatter plot
5- Histogram
6- Flowchart
7- Control charts
-------------------------------------------------------------------------------------
#7_Basic_Quality_Control_Tools #Check_sheet #Pareto_analysis #Fishbone #Scatter_plot #Histogram #Flowchart #Control_charts #CFturbo #Pump_simulation_using_ANSYS #Water_Hammer #أدوات_ضبط_الجودة_السبعة #نموذج_التحقق #مخطط_باريتو #مخطط_السبب_والأثر #مخطط_التشتت #مدرج_تكراري #خرائط_التدفق #خرائط_ضبط_الجودة
Fault detection based on novel fuzzy modelling csijjournal
The Fault detection which is based on fuzzy modeling is investigated. Takagi-Sugeno (TS) fuzzy model can
be derived by structure and parameter identification, where only the input-output data of the identified system are available. In the structure identification step, Gustafson-Kessel clustering algorithm (GKCA) is used to detect clusters of different geometrical shapes in the data set and to obtain the point-wise membership function of the premise. In the parameter identification step, Unscented Kalman filter (UKF) is
used to estimate the parameters of the premise’s membership function. In the consequence part, Kalman filter (KF) algorithm is applied as a linear regression to estimate parameters of the TS model using the input-output data set. Then, the obtained fuzzy model is used to detect the fault. Simulations are provided to demonstrate the effectiveness of the theoretical results.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 3: Describing, Exploring, and Comparing Data
3.3: Measures of Relative Standing and Boxplots
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
This video on Data science interview questions will take you through some of the most popular questions that you face in your Data science interviews. It’s simply impossible to ignore the importance of data and our capacity to analyze, consolidate, and contextualize it. Data scientists are relied upon to fill this need, but there is a serious dearth of qualified candidates worldwide. If you’re moving down the path to be a data scientist, you need to be prepared to impress prospective employers with your knowledge. In addition to explaining why data science is so important, you’ll need to show that you're technically proficient with Big Data concepts, frameworks, and applications. So, here we discuss the list of most popular questions you can expect in an interview and how to frame your answers.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. The data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data, you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its large library of mathematical functions
Perform scientific and technical computing using the SciPy package and its sub-packages such as Integrate, Optimize, Statistics, IO and Weave
4. Perform data analysis and manipulation using data structures and tools provided in the Pandas package
5. Gain expertise in machine learning using the Scikit-Learn package
Learn more at www.simplilearn.com/big-data-and-analytics/python-for-data-science-training
Similar to 2014-mo444-practical-assignment-04-paulo_faria (20)
Data Science Interview Questions | Data Science Interview Questions And Answe...
2014-mo444-practical-assignment-04-paulo_faria
1. Applying Machine learning techniques to select variables responsible for
compiler performance variation
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The report will show the results found after applying Ma-
chine Learning techniques to discover new ways for opti-
mising compiler codes. A researcher in the area of compil-
ers performed a series of experiments with the LLVM com-
piler, enabling and disabling optimizations independently
for each test program (discrete variables). The experiments
involve 45 different optimizations (input parameters) and
one target variable that is the program runtime (continu-
ous variable). The dataset comprises 46,945 examples (with
noisy data) divided in 19 different programs (around 2,400
instances for each program).
2. Activities
Regarding applying Classification Trees to apply induc-
tive inference, that is, reaching general conclusion from
specific examples, it can be cited Breiman et al. [1]. It
seems appropriate to apply this technique because the input
variables are discrete. It was also observed that the origi-
nal data contains several repetitions of the same input data,
which is another advantage in favour of applying Classifi-
cation Trees - its robustness to noisy data.
3. Proposed Solutions
It was implemented one algorithm to deal with the prob-
lem, developed in R language using rpart function and
method=class.
3.1. Classification Trees and Information Gain
To build the classification tree one fundamental concept
is to find the root node (attribute that best splits the data
over). One of the measures used is Entropy (H), which
measures the homogeneity of the examples, calculated as
below:
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
H(S) =
c
i=1
(−pi ∗ log2 pi) (1)
The tree split function to find non-leaf nodes will be In-
formation Gain, which measures the reduction on Entropy
as follows:
IG(S, A) = H(S) −
v values(A)
(
|Sv|
|S|
) ∗ H(Sv) (2)
where Sv is the subset of S for which A has value v.
3.2. Quality measures
To access the quality of the results it will be used Preci-
sion, Accuracy and AUC (based on ROC curve). Precision
is defined as the proportion of the true positives against all
the positive results (in the case the reference is the good per-
formance items). Accuracy is the proportion of true (cor-
rectly identified) results (both true positives and true neg-
atives) in the classification. AUC is the probability that
a classifier will rank a randomly chosen positive instance
higher than a randomly chosen negative one.
4. Experiments and Discussion
4.1. Data preprocessing
To deal with the large number of repetitions on data, we
used the function unique to let only substantial cases to an-
alyze.
4.1.1 Data splitting
The data was splitted in 3 partitions for each program under
analysis using the following propotions: 60% for training,
20% for validation and 20% for testing. This was imple-
mented on R as below:
s p l i t d f <− function ( dataframe , seed=NULL) {
i f ( ! i s . null ( seed ) ) s e t . seed ( seed )
index <− 1: nrow ( dataframe )
1
2. #60% f o r t r a i n i n g
t r a i n i n d e x <− sample ( index ,
trunc ( length ( index ) ∗ 0 . 6 ) )
t r a i n s e t <− dataframe [ t r a i n i n d e x , ]
o t h e r s e t <− dataframe [− t r a i n i n d e x , ]
o t h e r I n d e x <− 1: nrow ( o t h e r s e t )
#20% f o r t r a i n i n g and
#20% f o r t e s t i n g s e t
v a l i d a t i o n I n d e x <−
sample ( otherIndex ,
trunc ( length ( o t h e r I n d e x ) / 2 ) )
v a l i d a t i o n s e t <− o t h e r s e t [ v a l i d a t i o n I n d e x , ]
t e s t s e t <− o t h e r s e t [− v a l i d a t i o n I n d e x , ]
l i s t ( t r a i n s e t = t r a i n s e t ,
v a l i d a t i o n s e t = v a l i d a t i o n s e t ,
t e s t s e t = t e s t s e t )
}
The table summarizes number of instances after prepro-
cessing phase.
Prog. Noisy Unique Train Valid Test
Data Data Data (60 %) (20 %) (20 %)
1 2468 172 103 34 35
2 2473 222 133 44 45
3 2470 228 137 46 45
4 2475 217 130 43 44
5 2468 218 131 44 43
6 2479 250 150 50 50
7 2451 219 131 44 44
8 2476 201 121 40 40
9 2468 224 134 45 45
10 2472 197 118 39 40
11 2476 214 128 43 43
12 2472 191 115 38 38
13 2473 228 137 46 45
14 2467 210 126 42 42
15 2468 217 130 43 44
16 2470 168 101 34 33
17 2473 211 127 42 42
18 2478 245 147 49 49
19 2468 199 119 40 40
All 46945 4031 2418 806 807
Table 1. Dividing data in training/validation/testing
4.2. Runtime classification
To discretize the runtime values for each instance as Pos-
itive (good performance) or Negative (Not so good perfor-
mance), we used the function scale to apply z-normalization
(centering in the mean and dividing by the standard devia-
tion σ).
The first rule applied was the following partition around
the mean:
•if(z−norm−runtime < 0) ⇒ ”Good−performance”
•if(z−norm−runtime >= 0) ⇒ ”Bad−performance”
This approach is based on the histogram of the items (as
the example at Figure 1 for program 1).
It was also plotted the boxplots (Figure 2) to understand
if there are outliers and their distributions.
Figure 1. Histogram of the program 1.
Figure 2. Box plot of the program 1 runtime distribution.
The second rule applied was the following partition
around the quartiles:
•if(z−norm−runtime < 25%quartile) ⇒ ”V eryGood−performance
•if(z−norm−runtime < 50%quartile) ⇒ ”Good−performance”
•if(z−norm−runtime < 75%quartile) ⇒ ”Bad−performance”
•if(z−norm−runtime < 25%quartile) ⇒ ”V eryBad−performance”
4.3. Results
4.3.1 Partition around the mean
The classification tree for the entire dataset is at Figure 3.
To summarize the individual trees found using the mean
separation, a table with the 5 initial params found was cre-
ated (see Table 2)
4.3.2 Partition using quartiles
The classification tree for the entire dataset is at Figure 4.
3. Prog. DT Height Pruned Par 1 Par2 Par3 Par4 Par5
1 1 1 basicaa=1
2 7 2 sroa=0 loop.rotate=0
3 10 10 simplifycfg=0 sroa=0 gvn=0 memcpyopt=0 jump.threading=1
4 8 3 instcombine=0 sroa=0 loop.rotate=1 adce=1 functionattrs=1
5 2 2 licm=0 loop.rotate=0
6 4 3 sroa=0 simplycfg=0 instcombine=0 basicaa=0
7 1 1 sroa=1
8 1 1 tailcallelim=0
9 4 3 sroa=0 inline=0 loop.rotate=1 loop.deletion=0
10 4 1 functionattrs=0 loop.simplify=0 verify=0 simplifycfg=0
11 1 1 sroa=0
12 1 1 sroa=0
13 6 2 sroa=0 loop.rotate=1 globalopt=0 loop.deletion=1 lcssa=1
14 3 3 sroa=0 inlinecost=0 gvn=0
15 10 2 loop.rotate=0 tailcallelim=1 deadargelim=0 instcombine=1
16 2 2 inline=0 loop.rotate=0
17 4 3 simplifycfg=0 sroa=0 basicaa=0 instcombine=0
18 5 5 loop.rotate=0 sccp=1 indvars=1 ipscco=1 early.cse=1
19 1 1 sroa=0
All 2 2 sroa=0 inline=0
Table 2. Tree size and 5 top paramaters to find good performance for mean partition
Figure 3. Classification tree for all programs using mean.
To summarize the individual trees found using the quar-
tile separation, a table with the 5 initial params found was
created (see Table 3)
4.3.3 Classification trees quality measures
5. Conclusions and Future Work
Analysing Figure 3 and Figure 4 it is possible to find
the set of features which are most important for optimiz-
ing the code when using all the programs at the same time.
In both cases (mean and quartile separation), sroa=0 (a few
times as 1) and inline=0 where the first parameters in com-
mon. For the quartile separation it was also possible to
Figure 4. Classification tree for all programs using quartile.
use simplifycfg=0, gvn=0, basicaa=0,jump.threading=0 as
a way to classify in the first quartile (that are the best run-
time values). Table 2 presents how the solutions varies
for each program when applying mean partition. We used
a cross validation (using test set) technique to find the
pruned height size. Not all programs need sroa=0 and in-
line=0 as it would be expected. Some are common for
4. Prog. Height Pruned Par 1 Par2 Par3 Par4 Par5
1 8 4 basicaa=1 licm=0 strip.dead.prototypes=1 preverify=1
2 10 10 sroa=0 strip.dead.prototypes=0 basiccg=1 basicaa=1 scalar.evolution=1
3 7 7 sroa=0 simplifycfg=0 globalopt=1 memcep=1 loop.deletion=1
4 3 7 instcombine=0 sroa=0 loop.rotate=1
5 6 6 licm=0 loop.rotate=0 loop.idiom=1 instcombine=0 strip.dead.prototypes=0
6 7 2 sroa=0 loop.rotate=1
7 9 3 sroa=1 simplifycfg=0 basicaa=1
8 8 6 tailcallelim=0 basicaa=0 memdep=1 early.cse=1 loop.unroll=1
9 8 4 sroa=0 inline=0 loop.rotate=1 instcombine=1
10 9 9 loop.rotate=1 memdep=0 simplifycfg=0 basicaa=0 preverify=1
11 10 7 sroa=0 strip.dead.prototypes=0 basiccg=1 deadargelim=0 domtree=0
12 7 7 sroa=0 instcombine=1 loop.rotate=0 targetlibinfo=0 prune.eh=1
13 10 1 loop.rotate=1
14 7 2 sroa=0 inline.cost=0
15 6 5 loop.rotate=0 tailcallelim=0 prune.eh=0 correlated.propagation=0 preverify=0
16 8 7 inline=0 loop.rotate=0 jump.threading=0 targetlibinfo=1 notti=1
17 11 5 sroa=0 simplifycfg=0 basicaa=0 instcombine=0 deadargelim=0
18 8 3 sroa=0 indvars=0 constmerge=0
19 7 1 loop.rotate=1
All 6 6 sroa=0 inline=0 simplifycfg=0 gvn=0 basicaa=0
Table 3. Tree size and 5 top paramaters for quartile partition to find good performance
Prog. Prec. Prec. Acc. Acc. AUC AUC
Mean Quartiles Mean Quartiles Mean Quartiles
1 0.50 0.65 0.74 0.59 0.54 0.59
2 0.69 0.57 0.64 0.57 0.61 0.42
3 0.31 0.54 0.72 0.61 0.52 0.64
4 0.76 0.71 0.74 0.70 0.68 0.70
5 0.82 0.50 0.84 0.59 0.78 0.59
6 0.91 0.74 0.80 0.72 0.80 0.72
7 0.80 0.63 0.57 0.61 0.55 0.52
8 0.75 0.50 0.75 0.58 0.45 0.61
9 0.91 0.91 0.87 0.80 0.82 0.81
10 0.62 0.50 0.62 0.56 0.49 0.58
11 0.68 0.44 0.70 0.60 0.61 0.53
12 0.94 0.76 0.95 0.66 0.83 0.67
13 0.71 0.37 0.70 0.63 0.65 0.41
14 0.79 0.68 0.76 0.69 0.68 0.68
15 0.76 0.61 0.70 0.60 0.61 0.50
16 0.74 0.57 0.79 0.62 0.75 0.63
17 0.85 0.63 0.81 0.69 0.80 0.70
18 0.82 0.74 0.73 0.84 0.66 0.85
19 0.80 0.67 0.80 0.65 0.50 0.65
All 0.67 0.57 0.68 0.60 0.59 0.60
Table 4. Comparision of quality measures for each kind of partition
both partitions (mean and quartile) such loop.rotate (am-
biguous sometimes as 0 and others as 1), adce=1, in-
stcombine=0, licm=0, simplifycfg=0, tailcallelim (some-
times as 1 and others as 0). There are other variables
such as functionattrs=1, loop.deletion=1, lcssa=1, gvn=0,
sccp=1, indvars=1, ipscco=1, early.cse=1 that also helped
to classify these specific programs using the mean parti-
tion. Table 3 presents how solutions varies for each pro-
gram when using quartile partition. The main difference
is that the height of the tree is greater for quartile because
the number of classes is also greater. But there were some
new variables noticed (specifically by applying quartiles)
such as strip.dead.prototypes=1, globalopt=1, memdep=1,
loop.deletion=1, prune.eh=1, early.cse=1, loop.unroll=1.
Regarging the quality of the trees, Table 4 summarizes what
was found. The results from the individual programs used
training versus validation set, while the entire program used
training versus test set to avoid data collision and improve
the confidence in the analysis. From a general perspective
(using all programs), the mean partition gave the best pre-
cision (67 %) and Accuraccy (68 %) against 57% and 60%
respectively for . The AUC was not so high (around 60%)
in both cases. In table 4, it was also highlighted in bold
the cases equal or higher than 80% and underlined cases
equal or lower than 50%. The program 9 was the easiest
one to reach good quality levels, while programs 1, 3, had
the worst results (50% and 31% precision). For 1 and 3, the
quartile separation gave a better results (not so expressive
65% and 54% precision), the possible explanation is that
both has few good examples to train. Programs 5, 6, 12,
17, 18 individually had good classification results for mean
separation. Programs 10, 11 had few bad examples to train
and presented and intermediate classification quality result.
References
[1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and regression trees. Wadsworth, 1984. 1