ISEN 613_Team3_Final Project Report

ISEN 613- Engineering Data Analysis
Naman Kapoor
Vinayak Nair
Rahul Garg
Omkar Deshpande
Adriana De La Cruz
Multi-Attribute Classification of
Steel Plate Defects
Team 3

Executive Summary
Anomaly detection is vital in the industry and can be the difference between success and bankruptcy. Manufacturing
processes need to be continuously monitored so that any change in the process can be quickly identified and controlled
so that there is no production loss.
This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into
consideration the available historical data. The main objective of this project is to compare the working of different
classification models and decide one final model that will have the least misclassification rate (high prediction accuracy).
Many studies have been done by researchers on the comparative performances of multiclass classification techniques.
This project adds a new dimension by drawing comparisons between error rates of multiclass techniques and individual
classification techniques for each class. Although modelling for individual defects gives a very high accuracy rate, the
combined practical hierarchical model will not be as efficient due to the fact that its accuracy is a product of the
individual accuracies of the models used in the hierarchical model.
In this project, performances of techniques such as Linear Discriminant Analysis, Logistic Regression (individual and
multivariate), Random Forests (individual and multivariate), Single Decision Trees, Bagging, Support Vector Machines,
and Artificial Neural Networks have been compared and analyzed. Principal Component Analysis was also used to reduce
the dimensions of the given data.
The challenge was to decide whether to consider different models for all seven defects or to build a model for all defects
combined. The dataset also had many attributes and hence, it was difficult to select the most significant predictors and
avoid over-fitting. As there were more than two responses the coding that was used for binary classification wasn’t
applicable and had to deal with multi classification and new coding techniques had to be explored. Methods like
Artificial Neural Networks and C5.0 were used. These methods were completely new and efforts were required in terms
of literature review and coding to implement them.
The following table gives the misclassification error rate and area under curve for different modeling techniques used to
model the problem.
Modeling Technique Misclassification Error Rate
(percentage)
Area Under Curve
LDA 32.4 0.790
Decision Tree 36.0 0.784
Bagging 20.8 0.824
Random Forest 22.8 0.797
SVM 27.6 0.804
Neural Network Analysis 53.9 0.605
C5.0 19.4 0.831
From the above table it is clear that C5.0 modeling technique gives the least misclassification error which is 19.4 % and
also highest area under the ROC curve.
The main objective of this project was to compare the working of different classification models built using different
modeling techniques and propose one final model that will have the least misclassification rate (high prediction
accuracy). Thus, the results show the successful completion of the objectives. The final model proposed to be
implemented to predict the faults is C5.0. This method has an 81.6% prediction accuracy.

INTRODUCTION
Importance of the problem:
Present era is the era of quality and in today’s world of cut-throat competition and large scale production, only those
manufacturers survive who can provide good quality products and services that meet or exceed the expectations of the
customers. There is a need to continuously monitor the ongoing manufacturing processes so that any change in the
process can be identified quickly and rectified to prevent production loss. In manufacturing, operations managers can
use advanced analytics to take a deep dive into historical process data, identify patterns and relationships among
discrete process steps and inputs, and then optimize the factors that prove to have the greatest effect on yield. Many
global manufacturers in a range of industries and geographies now have an abundance of real-time shop-floor data and
the capability to conduct such sophisticated statistical assessments. They are taking previously isolated data sets,
aggregating them, and analyzing them to reveal important insights. In the steel industry, specifically alloy steel, creating
different defective products can impose a high cost for steel product manufacturer. One common fault out of all others
in producing low carbon steel grades is Pits & Blister defect. To remove this drawback, we need to grind the surface of
the steel product. Grinding cause waste of time and involved cost of the production will be increased. Incidence of
defects analysis is related to numerous factors including material analysis, production processes etc. So, if we are able to
correctly predict these defects based on the important parameters then in a way we know which of the parameters to
be controlled with high level of accuracy to minimize the defects and hence the defects. Thus the problem at hand in
this project deals with the data from a steel industry and the results obtained from this project can be used to predict
the faults and implement necessary changes.
Objective:
This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into
consideration the available historical data. The main objective of this project is to compare the working of different
classification models built using different classification techniques and propose one final model that will have the least
misclassification rate (high prediction accuracy). Various data mining techniques can be used to predict the steel
plate faults from the given data. In this project, the results of classification techniques such as Linear Discriminant
Analysis, Logistic Regression (individual and multivariate), Random Forests (individual and multivariate), Single Decision
Trees, Bagging, Support Vector Machines, and Artificial Neural Networks have been compared and the best model is
proposed. The model building also uses Principal Component Analysis to reduce the dimensions of the given data.
Scope of Work:
13th-15th 16th-19th 20th-23rd 24th-27th 28th-31st 1st-4th 5th-8th 9th-12th 12th-15th
1 Retrievingdataandunderstandingitsdetails
2 LiteratureReview&selectingsuitablesupervisedneuralnetworkmethod
3 Modelbuildingusingclassificationtechniqueslearnedinclass
4 Modelbuildingusingselectedneuralnetworkmethod:
5 Predictingresultsandconcludingthebestmodelingmethod
6 Reportmaking&documentation
SNo Actvity
November'2015
GANTChart
December'2015

LITERATURE REVIEWS
Following were the papers selected:
1. Steel Plates Faults Diagnosis with Data Mining Models. Fakhr,M., Elsayad, A. M. (2012). (Reviewed by Naman Kapoor)
2. Machine Learning Techniques for Anomaly Detection: An Overview. Omar,S. Ngadi,A. and Jebur, H. H. (2013).
(Reviewed by Naman Kapoor)
3. Neuralnet: Training of neural networks. Frauke Günther and Stefan Fritsch (2008). (Reviewed by Omkar
Deshpande)
4. An Empirical Comparison of Supervised Learning Algorithms. Caruana, R., & Niculescu-Mizil, A. (2006). (Reviewed
by Omkar Deshpande)
5. A SVM-based pipeline leakage detection and pre-warning system. Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin.
(2010). (Reviewed by Rahul Garg)
6. Steel faults diagnosis under predictive analysis. Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, (2013).
(Reviewed by Rahul Garg)
7. Classification of EEG signals using neural network and logistic regression. A. Subasi and E. Erçelebi. (2005).
(Reviewed by Rahul Garg)
8. A study of decision tree ensembles and feature selection for steel plates faults detection. Halawani, M. (2014).
(Reviewed by Vinayak Nair)
9. Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Tsoi, A.C., Pearson, R.A.
(1991). (Reviewed by Vinayak Nair)
10. Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study. Pohar, M., Blas, M., &
Turk, S. (2004). (Reviewed by Vinayak Nair)
Combined Takeaways
 Advanced decision trees are extremely efficient modeling techniques for multiclass classification problems.
 Artificial Neural Networks are a very powerful and complex algorithms but they have certain issues of
convergence and variable selection which need to be addressed.
 Supervised machine learning techniques significantly outperform the unsupervised ones when it comes to multi
classification problems.
 LDA is advisable in comparison to logistic regression, when the variables are normally distributed.

Reviewed by: Naman Kapoor
Steel Faults Diagnosis with Data Mining Models
-Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science, vol. 8, no. 4,
pp. 506-514, 2012.
Objective:
The key problem this paper addresses is the formation of an appropriate intelligent data mining model for anomaly
detection in the manufacturing industry on a particular dataset. Addressing this problem is important due to the need to
create intelligent fault diagnostic models with the help of data mining to enhance the quality of manufacturing and to
lessen the cost of product testing. It not only helps to keep away product quality problems but also facilitates
precautionary maintenance. The key objective of this paper is to use predictive analytics to select the best classification
model for the selected steel plate faults detection dataset by comparing different models using certain statistical
measures. The authors have addressed this problem by evaluating the performances of three of the popular and
effective data mining models (using supervised learning techniques) on the selected dataset and have presented their
views and outcomes on these. From their approach the authors found that the C5.0 decision tree with boosting achieved
the best results on the dataset which implies that decision trees have a greater impact on fault diagnosis than fellow
supervised learning techniques.
Approach:
The authors approached the problem by performing three multi classification techniques namely C5.0 decision tree
(C5.0 DT) with boosting, Multi Perception Neural Network (MLPNN) with pruning and Logistic Regression (LR) with step
forward on the steel plates fault dataset obtained from the University of California at Irvine (UCI) machine learning
repository. These models were formulated to diagnose seven commonly occurring faults of steel plate namely: Pastry,
Z_Scratch, K_Scratch, Stains, Dirtiness, Bumps and other faults. A brief description of the techniques used is presented
below:
I. C5.0 decision tree
The C5.0 DT algorithm is an improved version of the C4.5 and ID3 algorithms. C5.0 uses information gain as a measure of
purity, which is based on the notion of entropy. This method proved to be a major take out for us for our project.
The three methods used in the C5.0 tree construction are boosting, pruning and winnowing. While boosting and pruning
were known to us we were introduced to the concept of winnowing which preselects a subset of the attributes that are
selected to construct the tree. Winnowing ensures that the attributes which are irrelevant are excluded from the tree
building process. The authors used only 13 attributes (from 27) from the dataset to build the C5.0 tree.
II. Multilayer Perceptron Neural Network (MLPNN)
Artificial Neural Networks (ANNs) are biologically motivated and highly sophisticated analytical techniques capable of
modelling extremely complex nonlinear functions. The technique used to address this problem: MLPNN is considered a
powerful function approximate for prediction and classification problems, its structure is organized into layers of
neurons input, output and hidden layer. The MLPNN was trained using the Back Propagation (BP) training technique.
In this study the network was trained using the pruning approach which starts with a large network and removes the
weakest neurons (prunes) in the hidden and input layers as the training proceeds.

III. Logistic Regression
Logistic regression is a nonlinear regression technique for the prediction of dichotomous (binary) class attribute in terms
of the predictive ones. This algorithm does not predict the class attribute but predicts the odds of its occurrence using
the log likelihood (logit) function.
Results
The performance of each model was evaluated using three statistical measures: classification accuracy (compliment of
misclassification error rate), sensitivity and specificity. These measures are define using the values of True Positive (TP),
True Negative (TN), False Positive (FP) and False Negative (FN).
These charts depict that the performances of the C5.0 learning algorithm is the best model for training and test subsets.
Neural network model is the second best and finally the logistic regression is the worst one.
Summary
The major takeaways from this study were as follows:
 Advanced Decision Trees (C5.0 DT) are a very powerful data mining tool to use in predictive analytics of
multiclass anomaly detection with very high accuracy.
 Multilayer Perceptron Neural Networks with back propagation is a complex algorithm which standard and
simple to implement, it still suffers from convergence issues and requires initialization and adjustment of many
individual parameters to optimize its performance.
 Logistic Regression while although a very powerful modeling tool, assumes that the class attribute (the log odds,
not the event itself) is linear in the coefficients of the predictive attributes. The right inputs must be chosen with
their functional relationship to the class attribute.
 Amount, quality and the measuring process of data are key components of diagnostic accuracy.

Reviewed by: Naman Kapoor
Machine Learning Techniques for Anomaly Detection: An Overview
-S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International
Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013.
Objective:
The key problem this paper addresses is the issue of anomaly detection in the industry. Through this paper the authors
try to aid anomaly detection with the aid of machine learning techniques. The key reason why addressing this problem is
important because even though after many years of research the anomaly detection community is still confronting
difficult problems, the authors aim to further research this issue. The key objective of his paper is to present an overview
of research directions for applying supervised and unsupervised methods for managing the problem of anomaly
detection. The authors address this problem by providing a general architecture of anomaly intrusion detection systems
and conduct detailed discussions on the various machine learning techniques that come under supervised and
unsupervised learning and discussing their strengths and weaknesses in handling anomaly detection.
Approach:
The authors approached the problem by comparing different techniques under supervised and unsupervised machine
learning techniques and bringing out their strengths and weaknesses on anomaly detection. An overview of the two
approaches is given below:
I. Supervised Anomaly Detection
Supervised methods (also known as classification methods) required a labelled training set containing both normal and
anomalous samples to construct the predictive model. Theoretically, supervised methods provide better detection rate
than semi-supervised and unsupervised methods, since they have access to more information. However, there exist
some technical issues, which make these methods seem not accurate as they are supposed to be.
II. Unsupervised Anomaly Detection
These techniques do not need training data. As alternative, they based on two basic assumptions. First, they presume
that most of the network connections are normal traffic and only a very small traffic percentage is abnormal. Second,
they anticipate that malicious traffic is statistically various from normal traffic. According to these two assumptions, data
groups of similar instances which appear frequently are assumed to be normal traffic, while infrequently instances which
considerably various from the majority of the instances are regarded to be malicious.
The different techniques compared are shown in the table below:
Supervised Macine Learning Unsupervised Macine Learning
K-Nearest Neighbours Self Organising Maps
Neural Networks K-means Clustering
Decision Trees Fuzzy C-means Clustering
Support Vevtor Machines Expectation-Maximization Meta
Machine Learning Techniques for Anomaly Detection

Results
The results are shown in the table below:
Summary
The major takeaways from this review were:
 Machine learning techniques have received considerable attention among the anomaly detection researchers.
 Anomaly detection comprises supervised techniques and unsupervised techniques.
 The experiments demonstrated that the supervised learning methods significantly outperform the unsupervised
ones if the test data contains no unknown attacks.
 Among the supervised methods, the best performance is achieved by the non-linear methods, such as SVM,
multi-layer perceptron and the rule-based methods.
 Techniques for unsupervised such as K-Means, SOM, and one class SVM achieved better performance over the
other techniques although they differ in their capabilities of detecting all attacks classes efficiently.

An Empirical Comparison of Supervised Learning Algorithms
Reviewed by: Omkar Deshpande
Reference: Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms.
Proceedings Of The 23Rd International Conference On Machine Learning - ICML '06.
http://dx.doi.org/10.1145/1143844.1143865.
Objective:
The objective of this paper is to give an empirical comparison between supervised learning algorithms such as SVMs, neural nets,
logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted
stumps.
The main reason behind this paper is to publish comparison between these newly developed algorithms as the last comprehensive
empirical evaluation of supervised learning was the Statlog Project in the early 90’s.
The key objective of this paper is to provide the comparison between the algorithms based on variety of performance criteria such
as Precision/Recall, ROC, Lift, Accuracy, F-score, squared error, etc.
As a part of results after empirical comparison it was found that boosted trees were the best learning algorithm overall. Random
forests are close second, followed by un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that
performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. This implies that if a model is trained
using boosted trees it will give the best performance in predicting values as compared to other methods like random forest, SVM
etc.
Approach:
The model used by the authors for this paper is ADULT, COV TYPE and LETTER are from the UCI Repository (Blake & Merz, 1998).
COV TYPE has been converted to a binary problem by treating the largest class as the positive and the rest as negative. Random
5000 cases were taken as training data set and rest as testing. Also from those 5000 cases 4000 were used for training and 1000
cases to calibrate the model. Now various parameters like ROC, accuracy, lift are calculated for different algorithms and then a
column is obtained which gives mean normalized score for the eight metrics when model selection is done by cheating and looking
at the final test sets. The means in this column represent the best performance that could be achieved with each learning method if
model selection were done optimally.
Results:
From comparison it is seen that the models which perform the best can perform poorly than the average performing models. For
example, the best models on ADULT are calibrated boosted stumps, random forests and bagged trees. Boosted trees perform much
worse. Bagged trees and random forests also perform very well on MG and SLAC. On MEDIS, the best models are random forests,
neural nets and logistic regression. The only models that never exhibit excellent performance on any problem are naive bayes and
memory-based learning. that boosted trees were the best learning algorithm overall. Random forests are close second, followed by
un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes,
logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best
performance in predicting values as compared to other methods like random forest, SVM etc. The table below gives the results that
were used for the comparison of the techniques.

Summary:
It can be seen that boosted trees were the best learning algorithm overall. Random forests are close second, followed by un-
calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes,
logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best
performance in predicting values as compared to other methods like random forest, SVM etc. But this is not always the case. We
need to select the parameters carefully and then select the technique which best works for that parameter. For example
Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks,
etc. So it can be said that for medicinal area the model which gives high performance when it comes to ROC will be the best model.

Reviewed By: Omkar Deshpande
Training of Neural Networks
Reference: Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1,
June 2010
Objective:
The objective of this paper is to discuss the algorithm used in the neuralnet package and give its application in R. Also
discuss the advantages of the neuralnet package over other generalized linear models.
The main reason behind publishing this paper is to give the details of the neuralnet package developed by the authors
and also give the working example of it by using the infert dataset in R.
Artificial neural networks can be applied to approximate any complex functional relationship between input and output
variables. Unlike generalized linear models it is not necessary to pre specify the type of relationship between covariates
and response variables as for instance as linear combination. This makes artificial neural networks a valuable statistical
tool. They are in particular direct extensions of GLMs and can be applied in a similar manner.
Approach:
In this paper the authors first discuss the algorithm used in building the neuralnet package. Then the training of the
neuralnet model in R is discussed. Infert dataset is used for this purpose. The number of hidden neurons is determined
in relation to the needed complexity. A neural network with for example two hidden neurons is trained. Then the results
of backprop nnet and neuralnet are compared. Then the paper discusses the additional features such as the compute
function, confidence.interval function that come loaded in the neuralnet package.
Results:
This being an informative paper, discusses various functions available in neuralnet package and how to implement them
in R. A few comparative results provided in the paper include the comparison with the nnet package. For comparison,
neural networks are trained with the same parameter setting as above using neuralnet with algorithm="backprop" and
the package nnet. nn.bp and nn.nnet show equal results. Both training processes last only a very few iteration steps and
the error is approximately 158. Thus in this little comparison, the model fit is less satisfying than that achieved by
resilient backpropagation.

Summary:
This paper introduced multi layer perceptron and supervised learning. It also took into consideration the use of the
package neuralnet available in R for modeling functional relationships between covariates and response variables.
neuralnet contains a very flexible function that trains multilayer perceptrons to a given data set in the context of
regression analyses. It is a very flexible package since most parameters can be easily adapted. For example, the
activation function and the error function can be arbitrarily chosen and can be defined by the usual definition of
functions in R.

Reviewed By: Rahul Garg
A SVM-BASED PIPELINE LEAKAGE DETECTION AND PRE-WARNING SYSTEM
Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system",
Measurement, vol. 43, no. 4, pp. 513-519, 2010.
Objective:
This paper addresses the detection problem of pipeline leakages which may occur due to various reasons like manual
digging and illegal construction. This paper indicates the effectiveness of SVM over traditional machine learning
techniques which is based on the assumption of availability of infinite training data. The problem of gas leakages are a
concern for industries as they lead not only to huge monetary losses but also may have very tragic outcomes like
outbreak of diseases and even deaths. The timely detection of any suspected leakages can be very beneficial to the
industries as well as to general masses. The objective of this paper is to monitor and locate the possible abnormal events
(e.g. manual digging above a pipeline and illegal constructions, etc., which might cause a pipeline leakage) along pipeline
before a leakage takes place, a new pipeline leakage detection and pre-warning system. The authors of this paper have
employed SVM as the classifier to recognize these abnormal events. Three cases gas leakage, manual digging and human
walk above the pipeline were created and a series of experimental trials were used to train the model. Next, this model
was used to detect any abnormal events for classification and it provided quite accurate results. The authors found that
SVM can prove to be a lot better and accurate technique for predicting gas leakages along pipelines as compared to the
empirical risk minimization method. This implies that although SVM is comparatively a new technique but it is quite
accurate for predictive analytics in case of multi classification problems.
Approach:
The authors have followed the multi-classification approach of predictive data analytics. Since, there was no historical
data available for, the authors collected data for training the model by conducting various trials. Basically two types of
trials were done for this. One was the abnormal events identification and other abnormal events location trial. Three
cases namely gas leakage; manual digging and human walk above the pipeline were created and number of columns or
prediction terms is eight. Twenty samples were collected from each case randomly for training and ten samples were
chosen from every case to test the trained SVM model. The misclassification rate on the test data will tell, how
accurately the model has performed and whether it can be deployed in actual practices or not. For, training process
“one-against-one” method is employed. The multi-class SVM trained classifier which the authors obtained is shown in
below figure. The two axes are the first two predictors out of eight and the circled data points show the support vectors.
Results:

The detection results from SVM recognized correct cause of leakage more than 95 % of the times and locate abnormal
events quite accurately. Below photo shows the prediction results, where 1, 2 and 3 are three different categories of
abnormality. Out of the below results only sample 12 has been recognized incorrectly.
Summary:
This paper represented the problem of pipeline leakage detection which was a problem of multi classification predictive
analysis. The major take away from this review is that SVM can work quite accurately for multi classification especially in
cases where the training data is not very large like in this case of leakage detection. This technique is far better as
compared to the traditional machine learning methods like ERM. Among several methods available for multi-class
classification ‘‘one-against-one” SVM method is more suitable for practical use than others.

STEEL FAULTS DIAGNOSIS USING PREDICTIVE ANALYSIS
Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International
Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13.
Objective:
The key problem which this paper discusses is the generation of various types of defects in manufactured steel plates
especially made of alloyed steel in steel industry. It is quite imperative to address this problem because rectifying these
defects by grinding or milling causes waste of time and augments the cost of production which could be prevented. This
paper aims at performing steel fault diagnosis using predictive analytics, so that the defect generation rate can be
minimized by finely tweaking the factors responsible for it. To address this problem the authors of this paper have used
classification modeling techniques namely Decision trees, Multilayer perceptron neural networks and Logistic regression
to develop a model which will diagnose the faults as accurately as possible. After developing different models it was
found that decision trees provide the best results as it has lowest misclassification rate. This implies that using Decision
Trees model could be a good option for steel fault diagnosis using data mining techniques.
Approach:
The data set used in this review has been taken from UCI repository and it classifies steel plate faults into seven different
types which makes this a case of multi classification predictive analysis. The authors of this paper have tried various
methods of classification and then selected the best method based on the misclassification rate. The methods used are
decision trees, multilayer perceptron neural networks and logistic regression. The C4.5 boosting algorithm with 10 trials
was used for decision trees. After all the models were built, and one out of the three was selected, the genetic algorithm
was used to find the best optimal solution, which works as per the following:
1. Initialize random population of n chromosomes
2. Evaluate the fitness value f(x) of each chromosome x in population
3. Create a new population by repeating following steps
 Select two parent chromosomes from given population according to fitness. (Chromosomes having better fitness
value, the bigger chance to be chosen)
 Cross over the parents to form a new offspring. If no crossover then offspring is an exact copy of the parents.
 Mutate new offspring at each locus.
 Place new offspring in the new population.
4. Use new generated population for a further execution.
5. If the end condition is fulfilled, stop, and return best solution in the current population.
6. Go to step 2
The best optimal solution chosen in this case was the solution number seven based on the output results shown by this
genetic algorithm.

Results:
The results from this review have been shown in the below table:
S No. Method Classification Accuracy Classification Error
1. Decision Tree 94.38 % 5.62 %
2. Multilayer Perceptron 83.87 % 16.13 %
3. Logistic Regression 72.64 % 27.36 %
The above table shows that out of the three classification techniques used, decision trees gave the best results as the
misclassification rate with decision trees is the least.
Summary:
This review helped me in gaining insights on the methods we can try for a multi classification predictive analytics
problem and various ways that can be used to improve those models. C4.5 algorithm can be used to improve decision
trees and pruning algorithm can be used to improve the multilayer perceptron model. Another important take away
from this review was the fact that boosted Decision Trees with C4.5 package performed the best in classifying various
steel defects out of the three models especially when the results have to be interpreted by humans.

CLASSIFICATION OF EEG SIGNALS USING NEURAL NETWORK AND LOGISTIC REGRESSION
A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods
and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005.
Objective:
This paper is about the detection of epileptiform discharges in the EEG using logistic regression and artificial neural
network models. Epileptic seizure can occur in many different ways and EEG signals carry a lot of information and
accurate classification and evaluation of these signals may turn out be a breakthrough in medical science domain. This
paper aims to compare the traditional method of logistic regression to the more advanced neural network techniques, as
mathematical tools for developing classifiers for the detection of epileptic seizure in multi-channel EEG. The authors have
developed two different models using logistic regression and artificial neural networks. Multilayer perceptron neural
network (MLPNN) will be used with back propagation and Levenberg—Marquardt training algorithm. After this a comparison has
been done in both the methods. After comparing the results from both the papers, the authors concluded that the neural
network analysis proved to be a better model than the logistic regression. This implies that MLPNN is more accurate and
easier to build, as for developing logistic regression equations we start with no knowledge as to the best combination of the
parameters or the shape and degree of nonlinearity required to produce an optimal model.
Approach:
The EEG data used in this study was downloaded from 24-h EEG recorded from both epileptic patients and normal subjects. In order
to assess the performance of the classifier, 500 EEG segments were selected containing spike and wave complex, artifacts and
background normal EEG. Twenty absence seizures (petit mal) from five epileptic patients admitted for video-EEG monitoring were
analyzed. Next each of the signals was inspected by experienced neurologists to score epileptic and normal signals. After this
wavelet transform analysis was done as it captures transient features and localizes them in both time and frequency content
accurately. Next logistic regression and neural network classifiers were developed randomly selecting 300 examples out of 500
available as the training set and remaining 200 were kept for testing and validating the developed models. The selection of the
optimal network was based on monitoring the variation of error and some accuracy parameters as the network was expanded in the
hidden layer size and for each training cycle. The sum of square errors was used for choosing the optimal model and the optimum
number of nodes in the hidden layer was found to be 21. Finally after testing both the developed models, the best one was
chosen based on the misclassification error rate and sensitivity-specificity analysis. Below table shows the division of the
collected data as training and testing data.

Results:
Below table provides the results from comparing the two models on the basis of accuracy of classification and sensitivity and
specificity analysis on test data. We can clearly see that the MLPNN has more accuracy and larger area under the ROC curve.
Summary:
This paper helped in having a better understanding of the neural network analysis which is a new technique apart from the
techniques learnt in class. It provided insights on the procedure of choosing optimal number of hidden layers in the model and
limitations of logistic regression model. Another major take away from this is the evaluation and comparison of traditional logistic
regression model used for classification with a much newer multilayer perceptron neural network analysis. At last but not the least,
this paper made me aware of the wavelet transform analysis which is very effective in capturing transient features and localizing
them in both time and frequency domains.

Reviewed by: Vinayak Nair
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAULTS
DETECTION
Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults detection.
International Journal of Technical Research and Applications, 2(4), 127-131.
1. Objective:
Detection of steel plate defects is a serious problem in the industry and often, they’re performed by human operators,
which is expensive and slow. This can be tackled if the process is automated. The paper shows the application of
decision tree ensembles for fault detection. Many decision tree ensembles random subspace, bagging, adaBoost.M1 and
random forests are used to perform the steel plate fault detection and the best method for this problem is found out.
The effect of removing insignificant features is also studied. The results suggest that random AdaBoost.M1 and random
subspace are the best ensemble methods with a prediction accuracy greater than 80%.
2. Approach:
Random subspaces, bagging, adaBoost.M1 and random forests classifier ensembles were performed on the UCI dataset
and the prediction accuracies were calculated. Different selection of predictors were also tried out.
3. Results:
The classification errors for the methods with different selection of predictors are tabulated below:
The classification error with all the predictors included:
The classification error with 20 most important predictors included:
The classification error with 15 most important predictors included:

Random Subspace performed the best for the first and third method. AdaBoost.M1 came first for the second method.
When the best 20 predictors were selected, results of all the methods improved except Random Subspaces. In the third
method, the performance of models reduced, which indicated some important predictors were missed out.
4. Summary:
The single decision tree model always gave bad results in comparison to Random Subspace, Adaboost.M1, Bagging and
Random Forests, which means that we’ll have to use decision tree ensembles in the project, as well. Feature selection is
also very important as selecting the most important predictors reduces the error rate.

COMPARISON OF THREE CLASSIFICATION TECHNIQUES, CART, C4.5 AND MULTI-LAYER
PERCEPTRONS
Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons.
Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969.
1. Objective:
There are many popular algorithms such as CART (Classification and regression tree), MLP (multilayer perceptron) and
C4.5. There is a need to know how these methods compare against each other. By comparing different methods on
constrained data, we can make qualitative statements about the methods. Hence, addressing this problem can help
individuals in making less mistakes while applying a particular method to practical problems.
The key objective is to compare 3 algorithms, CART, MLP and C4.5 on classification and generalization capabilities. The
algorithms are carried out on a version of the Penzias example and the results are summarized.
It was found that generally, the MLP has better classification and generalization accuracies compared with the other two
algorithms.
2. Approach:
For comparing the classification performance, data known as the clump example ( 8th order Penzias) was used. All 256
examples were used as both training and testing data sets.
For comparing the generalization performance, the same data is used and the first 200 example are set as training set
and the rest as testing.
Parameters used: In the MLP, both the learning rate and the momentum are set at 0.1. The architecture used is: 8 input
neurons, 5 hidden layer neurons, and 4 output neurons. In CART, the prior probability is set to be equi-probable. The
pruning is performed when the probability of the leaf node is equal 0.5. In C4.5, all the default values are used.
3. Results:
Following are the classification results:
Where mlp1 and mlp2 are the values related to the MLP when it has run for 10000 iterations and 100000 iterations
respectively. The MLP accuracies improve with the number of iterations (till about 20000 iterations).
Following are the generalization results on the same data:

The generalization accuracy of the MLP is observed to be better than CART, and is comparable to C4.5.
4. Summary:
It is found that the MLP once it is converged, in general, has a better classification and generalization accuracies
compared with CART, or C4.5. On the other hand it is also noted that the prediction errors made by each algorithm are
different. This indicates that there may be a possibility of combining these algorithms in such a way that their prediction
accuracies could be improved. This is presented as a challenge for future research.

COMPARISON OF LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSIS: A
SIMULATION STUDY
Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation
Study. Metodoloski zvezki, 1(1), 143-161.
1. Objective
Linear Discriminant Analysis (LDA) and Logistic Regression (LR) are two widely used statistical methods. Though both of
them can be used to develop linear classification models, we need to have a set of guidelines for proper selection. While
LR makes no assumptions on the distribution of the explanatory data, LDA has been developed for normally distributed
explanatory variables. The appropriate method to a problem would always give better results. The objective of the
paper is understand when to choose LDA and logistic regression. The two methods are compared and performance is
studied using simulations. The results of LDA and LR were found to be close whenever the normality assumptions are
not too badly violated, and some guidelines were set for recognizing these situations. The inappropriateness of LDA in all
other cases is discussed.
2. Approach:
The simplest and the most frequently used criterion for comparison between the two methods is classification error
(percent of incorrectly classified objects; CE). However, classification error is a very insensitive and statistically inefficient
measure (Harrell, 1997). Harrell and Lee (1985) proposed four different measures of comparing predictive accuracy of
the two methods. These measures are indexes A, B, C and Q. They are better and more efficient criteria for comparisons
and they tell us how well the models discriminate between the groups and/or how good the prediction is.
Where, Pk denotes an estimate of P(Yk=1|Xk) from (2.1) and I is an indicator function ,Pi is a probability of classification
into group i, Yi is the actual group membership (1 or 0), and n is the sample size of both populations.
Random samples of size n and m from two multivariate normal populations with different mean vectors, but equal
covariance matrix Σ. The mean vector of one group is always set at (0,0). The distance to the other one is measured
using Mahalanobis distance, while the direction is set as the angle (denoted by υ) to the direction of the eigenvector of
the covariance matrix.
Each sample is then randomly divided into two parts, a training and a test sample. The coefficients of LDA and LR are
computed using the first sample and then predictions are made in the second one. The sampling experiment is
replicated 50 times. Each time the indexes for both methods are computed. Finally, the average value of indexes and the
proportion of simulations in which LR performs better are recorded.
After sampling, the normally distributed variables can be categorized, either only one or both of them. The minimum
and maximum value are computed, then the whole interval is divided into a certain number of categories of equal size.
3. Results:
The sample size has the most obvious impact on the difference between methods. LDA assumes normality and the
errors it makes in prediction are only due to the errors in estimation of the mean and variance on the sample. On the
contrary, LR adapts itself to distribution and assumes nothing about it. Therefore, in the case of small samples, the
difference between the distribution of the training sample and that of the test sample can be substantial. But, as the
sample size increases, the sampling distributions become more stable which leads to better results for the LR.

Consequently, the results of the two methods are getting closer because the populations are normally distributed.
The results from Table 1 confirm this consideration. As the sample size increases, the LDA coefficient estimations
become more accurate and therefore all four indexes are improving. The LR indexes are increasing even faster, thus
approaching those of LDA. Decreasing difference between the two methods is best presented with the Q index, which is
the most sensitive one. As the differences between index means are negligible, it is also interesting to look at the
proportion of simulations where LR performs better. It can be seen that the value of rates to which we pay special
attention that of B index and of Q index, is constantly increasing.
In the case of other changes, the results of the two methods were found to remain very close; in fact LDA is only a little
bit better than LR.
Simulations are carried out to study the effects of categorization and non-linearity, but are not presented in this
literature review due to a lack of space. However, the major takeaways from the results have been summarized in the
next section.
4. Summary:
LDA is a more appropriate method when the explanatory variables are normally distributed. In the case of categorized
variables, LDA remains preferable and fails only when the number of categories is really small (2 or 3). The results of LR,

however, are in all these cases constantly close and a little worse than those of LDA. But whenever the assumptions of
LDA are not met, the usage of LDA is not justified, while LR gives good results regardless of the distribution. As the
estimates for LR are obtained by the maximum likelihood method, they have a number of nice asymptotic properties as
well.

Project Approach
Analysis Flow Chart:
Problem Description
Propose the best model with highest prediction accuracy that can be implemented in the steel plate manufacturing
process to detect faults during the process and thus help in reducing them by taking proper preventive measures. The
assumptions are
 The data available is the exact data that is taken from the production line and has no manipulations.

 The data is not biased and is randomly selected data from different production lines (if present) and collected over a
period of time.
Given Data
The data used for this project is taken from the UCI library. This dataset consists of 7 different steel plate faults and 27
attributes which contain the features of the steel plate manufactured and also the manufacturing process.
Data Set Information:
Type of dependent variables (7 Types of Steel Plates Faults):
1.Pastry
2.Z_Scratch
3.K_Scatch
4.Stains
5.Dirtiness
6.Bumps
7.Other_Faults
Attribute Information:
27 independent variables:
X_Minimum
X_Maximum
Y_Minimum
Y_Maximum
Pixels_Areas
X_Perimeter
Y_Perimeter
Sum_of_Luminosity
Minimum_of_Luminosity
Maximum_of_Luminosity
Length_of_Conveyer
TypeOfSteel_A300
TypeOfSteel_A400
Steel_Plate_Thickness
Edges_Index
Empty_Index
Square_Index
Outside_X_Index
Edges_X_Index
Edges_Y_Index
Outside_Global_Index
LogOfAreas
Log_X_Index
Log_Y_Index
Orientation_Index

Luminosity_Index
SigmoidOfAreas
Preliminary analysis of the data:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas
1 42 50 270900 270944 267
2 645 651 2538079 2538108 108
3 829 835 1553913 1553931 71
4 853 860 369370 369415 176
5 1289 1306 498078 498335 2409
6 430 441 100250 100337 630
X_Perimeter Y_Perimeter Sum_of_Luminosity
1 17 44 24220
2 10 30 11397
3 8 19 7972
4 13 45 18996
5 60 260 246930
6 20 87 62357
Minimum_of_Luminosity Maximum_of_Luminosity
1 76 108
2 84 123
3 99 125
4 99 126
5 37 126
6 64 127
Length_of_Conveyor TypeOfSteel_A300 TypeOfSteel_A400
1 1687 1 0
2 1687 1 0
3 1623 1 0
4 1353 0 1
5 1353 0 1
6 1387 0 1
Steel_Plate_Thickness Edges_Index Empty_Index
1 80 0.0498 0.2415
2 80 0.7647 0.3793
3 100 0.9710 0.3426
4 290 0.7287 0.4413
5 185 0.0695 0.4486
6 40 0.6200 0.3417
Square_Index Outside_X_Index Edges_X_Index Edges_Y_Index
1 0.1818 0.0047 0.4706 1.0000
2 0.2069 0.0036 0.6000 0.9667
3 0.3333 0.0037 0.7500 0.9474
4 0.1556 0.0052 0.5385 1.0000
5 0.0662 0.0126 0.2833 0.9885
6 0.1264 0.0079 0.5500 1.0000
Outside_Global_Index LogOfAreas Log_X_Index Log_Y_Index
1 1 2.4265 0.9031 1.6435
2 1 2.0334 0.7782 1.4624
3 1 1.8513 0.7782 1.2553
4 1 2.2455 0.8451 1.6532
5 1 3.3818 1.2305 2.4099
6 1 2.7993 1.0414 1.9395
Orientation_Index Luminosity_Index SigmoidOfAreas Pastry
1 0.8182 -0.2913 0.5822 1
2 0.7931 -0.1756 0.2984 1
3 0.6667 -0.1228 0.2150 1
4 0.8444 -0.1568 0.5212 1

5 0.9338 -0.1992 1.0000 1
6 0.8736 -0.2267 0.9874 1
Z_Scratch K_Scratch Stains Dirtiness Bumps Other_Faults
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
The preliminary data analysis shows that
 There is at least one defect associated with every row of attributes.
 There are 1941 entries of defects in the whole dataset which is actually equal to the total number of rows in the
dataset.
 Other_Faults account for the majority of the defects; almost 35% of the defects recorded are Other_Faults. Thus it
can be fairly predicted that the misclassification error may go high for this defect.
 No two defects are associated with a single row of input. That is only one defect occurs for a particular row of
attributes in the data.
Description of new techniques used:
The new techniques chosen for this project are neural network analysis and C5.0 Decision trees.
1) Artificial Neural Networks
What is a Neural Network?
An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous
systems, such as the brain, process information.
Type of Defect Number of Occurrences
Pastry (1) 158
Z_Scratch (2) 190
K_Scratch (3) 391
Stains (4) 72
Dirtiness (5) 55
Bumps (6) 402
Other_Faults (7) 673

Components of a neuron The synapse
The figure above shows the structure of the human neural system. In the human brain, a neuron gets signals from all the
parts of the body through huge number of Dendrites. The neuron then sends signals as electrical activity through Axon.
Learning occurs by change in the energy levels of the neurons.
What is Artificial Neural Networks?
Artificial neural network is a computing system made up of a number of simple, highly interconnected processing
elements, which process information by their dynamic state response to external inputs.
The structure of a neural-network algorithm has three layers:
 The input layer feeds past data values into the next (hidden) layer. The black circles represent nodes of the
neural network.
 The hidden layer encapsulates several complex functions that create predictors; often those functions are
hidden from the user. A set of nodes (black circles) at the hidden layer represents mathematical functions that
modify the input data; these functions are called neurons.
 The output layer collects the predictions made in the hidden layer and produces the final result: the model’s
prediction.
 Neurons in a neural network can use sigmoid functions to match inputs to outputs. When used that way, a
sigmoid function is called a logistic functionand its formula looks like this:
 f(input) = 1/(1+eoutput
)
Artificial Neural Network Representation
2) C5.0 Decision Trees
Decision tree can be considered as a system that allows organizing a huge amount of information graphically.

A decision tree consists of internal nodes that represent the decisions corresponding to the hyper-planes or split points
(i.e., which half-space a given point lies in), and leaf nodes that represent regions or partitions of the data space, which
are labeled with the majority class. A region is characterized by the subset of data points that lie in that region.
One of the advantages of decision trees is that they produce models that are relatively easy to interpret. In particular, a
tree can be read as set of decision rules, with each rule’s antecedent comprising the decisions on the internal nodes
along a path to a leaf, and its consequent being the label of the leaf node. Further, because the regions are all disjoint
and cover the entire space, the set of rules can be interpreted as a set of alternatives or disjunctions.
An example of the decision tree is seen in the following figure.
C5.0 algorithm acts similar to ID3 but improves a few of ID3 behaviors:
The new features (versus ID3) are:
1) Accepts both continuous and discrete features
2) Handles incomplete data points
3) Pruning is already included in the package and thus the results are after pruning.
4) Ability to use attributes with different weights.
5) Scalability is enhanced by multi-threading; C5.0 can take advantage of computers with multiple CPUs and/or
cores

IMPLEMENTATION
Linear Discriminant Analysis
The modified multiclass dataset was modeled using Linear Discriminant Analysis to obtain the confusion matrix
and misclassification error rate for the test dataset. K-fold cross validation was also performed on the test data to
confirm our results.
>##LDA
>lda.model= lda(alldefects~., data = train2)
>lda_pred= predict(lda.model, test2)
>table(lda_pred$class, test.alldefects)
test.alldefects
A B C D E F G
A 25 0 0 0 0 1 15
B 5 50 0 0 0 4 10
C 2 0 91 0 0 0 4
D 0 0 1 26 0 0 2
E 3 0 0 0 18 0 6
F 4 2 4 0 1 67 40
G 9 5 27 1 4 39 117
>mean(lda_pred$class!= test.alldefects)
[1] 0.3241852
>##CrossValidation
> lda.cv=lda(alldefects~.,test2, CV=TRUE)
>table(lda.cv$class,test.alldefects)
test.alldefects
A B C D E F G
A 24 0 1 0 1 1 15
B 5 48 0 0 0 5 10
C 0 0 104 0 0 0 2
D 1 0 1 24 0 0 1
E 3 0 0 0 16 0 9
F 4 3 1 0 1 61 42
G 11 6 16 3 5 44 115
>mean(lda.cv$class!= test.alldefects)
[1] 0.3276158
The misclassification and cross validation error were 32.42% and 32.76% respectively.
Decision Tree
A single decision tree was then modelled on the modified dataset. The tree was also pruned to reduce the number of
branches and simplify the tree.
> ##tree
>library(tree)
> tree1=tree(train2$alldefects~.,data=train2)
>plot(tree1)
>text(tree1 ,pretty =0)

>tree.pred=predict(tree1,test2,type="class")
>table(tree.pred ,test.alldefects)
test.alldefects
tree.pred A B C D E F G
A 0 0 0 0 0 0 0
B 3 51 0 0 0 2 5
C 0 0 98 0 0 0 2
D 0 0 0 23 0 0 1
E 0 0 0 0 0 0 0
F 6 0 0 1 7 80 56
G 39 6 25 3 16 29 130
>mean(tree.pred!=test.alldefects)
[1] 0.3447684
> ##pruning
>set.seed (1)
>cv.data =cv.tree(tree1 ,FUN=prune.misclass )
>plot(cv.data$size ,cv.data$dev ,type="b")
>plot(cv.data$k ,cv.data$dev ,type="b")
>prune.data = prune.misclass(tree1 ,best =9)
>plot(prune.data)
>text(prune.data,pretty =0)
> tree.pred2=predict(prune.data , test2 ,type="class")
>table(tree.pred2 ,test.alldefects)
test.alldefects
tree.pred2 A B C D E F G
A 0 0 0 0 0 0 0
B 3 51 0 0 0 2 5
C 0 0 88 0 0 1 8
D 0 0 0 23 0 0 1
E 0 0 0 0 0 0 0
F 2 0 0 0 1 59 28
G 43 6 35 4 22 49 152
>mean(tree.pred2!=test.alldefects)
[1] 0.3602058

The misclassification error rates obtained from the original and the pruned tree were 34.5% and 36% respectively. The
error rate did not increase a lot thus justifying pruning to make the decision tree more readable.
Bagging
Bagging was used on the dataset to reduce the variance obtained in a decision tree model by averaging the
observations and also effectively increasing the training datasets via bootstrapping.
> ## bAGGING
>set.seed (1)
>library(randomForest)
>bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE)
>yhat.bag = predict (bag.data ,test2)
>plot(yhat.bag , test.alldefects)
>abline (0,1)
>table(yhat.bag, test.alldefects)
test.alldefects
yhat.bag A B C D E F G
A 30 0 0 0 0 5 7
B 0 50 0 0 0 1 0
C 0 0 112 0 0 0 1
D 0 0 0 24 0 0 1
E 1 0 0 0 19 1 2
F 4 0 0 1 2 76 32
G 13 7 11 2 2 28 151
>mean(yhat.bag!=test.alldefects)
[1] 0.2075472
The misclassification error rate obtained was 20.75%.
Random Forest
Random Forest bootstrapping method was applied on the modified dataset to de-correlate the bagged trees
thus further reducing the variance.

> #randomforest
>set.seed (1)
>rf =randomForest(alldefects~.,data=train2 , importance =TRUE)
>yhat.rf = predict (rf ,test2)
>table(yhat.rf, test.alldefects)
test.alldefects
yhat.rf A B C D E F G
A 25 0 0 0 0 6 5
B 1 50 0 0 0 0 4
C 0 0 112 0 0 0 1
D 0 0 0 24 0 0 1
E 0 0 0 0 19 0 3
F 3 1 0 1 2 73 33
G 19 6 11 2 2 32 147
>mean(yhat.rf !=test.alldefects)
[1] 0.2281304
The misclassification error rate obtained was 22.81% a bit higher than bagging but it ensures a lower variance which is generally
better to implement on future data points.
C5.0
An advanced decision tree technique known as C5.0 was also used to model the modified dataset.
> #C50
>crx<- data[ sample( nrow( data ) ), ]
> X <- crx[,1:27]
> y <- crx[,35]
>trainx<- X[1:1358,]
>trainy<- y[1:1358]
>testx<- X[1358:1941,]
>testy<- y[1358:1941]
>model<- C5.0( trainx, trainy, trials=75 )
> p <- predict( model, testx, type="class" )
>table(p, testy)
testy
p A B C D E F G
A 32 0 0 0 1 2 6
B 1 39 0 0 0 1 4
C 0 0 111 0 0 1 3
D 0 0 0 26 0 2 0
E 1 0 0 0 14 2 2
F 11 0 0 1 0 96 19
G 15 1 7 1 1 31 153
>mean(p != testy)
[1] 0.1934932
Support Vector Machines
SVM was tried on the training data set for different values of ‘C’ and the best results came out with C=15.
> svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15)
>summary(svm.fit)
Call:
svm(formula = alldefects ~ ., data = train2, type = "C", kernel = "polynomial",degree =
3, cost = 15)

Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 15
degree: 3
gamma: 0.03703703704
coef.0: 0
Number of Support Vectors: 819
( 43 225 347 87 74 18 25 )
Number of Classes: 7
Levels:
A B C D E F G
The plot for SVM on the training data has been shown in below figure. It is a 2-D plot with its axes as Edges_X _Index
and Edges_Y_Index. The circular symbols show the data points and the triangles show support vectors.
> predicted=predict(svm.fit,test2)
>table(predicted,test.alldefects)
test.alldefects
predicted A B C D E F G
A 28 1 0 0 0 6 7
B 0 49 0 0 0 3 6
C 0 1 113 0 0 0 4
D 0 0 0 25 0 0 0
E 0 0 0 0 17 0 3

F 6 0 3 1 5 69 53
G 14 6 7 1 1 33 121
>mean(predicted!=test.alldefects)
[1] 0.2761578045
The misclassification rate for SVM on testing data is about 27.61%.
Artificial Neural Networks
For this project artificial neural network model is developed by using different methods. First a model is tried using nnet
function in the nnet library in R. The second model is created using the multi level perceptron. This code uses the mlp
function provided in the RSNNS library in R.
1st
Method
In this method the model was built by using artificial neural network using the nnet function from the nnet library in R.
Many attributions were tried by changing the number of hidden layers expresses as size in the code. Also rang, decay
and matix were changed to get a lower misclassification rate. Finally the best model had 20 hidden layers and other
attributes as seen in the code.
train.nnet<-
nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000)
# weights: 707
initial value 2736.596055
iter 10 value 2082.470121
iter 20 value 2006.626658
iter 30 value 1963.775429
iter 40 value 1907.670254
iter 50 value 1901.104216
iter 60 value 1841.389091
iter 70 value 1815.725249
iter 80 value 1804.856698
iter 90 value 1801.382263
iter 100 value 1801.021638
iter 110 value 1797.549455
iter 120 value 1797.305184
iter 130 value 1797.183004
iter 140 value 1796.918336
iter 150 value 1795.256115
iter 160 value 1793.025804
final value 1792.714314
converged
test.nnet<-predict(train.nnet,test2,type=("class"))
table(test2$alldefects,test.nnet)
test.nnet
1 3 7
1 0 5 43
2 0 7 50
3 0 98 25
4 0 0 27
5 0 1 22
6 0 6 105
7 1 22 171
mean(test.nnet!=test2$alldefects)
[1] 0.5385935

The misclassification rate for ANN on testing data is about 53.8%.
2nd
method
In this method the artificial neural network model was built by using the mlp function available in RSNNS library in R.
This multi layer perceptron takes the predictors, responses, number of hidden layers as input.
> model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F)
> test.cl(y[-samp,], predict(model, x[-samp,]))
cres
true 3 7
1 3 42
2 2 52
3 50 63
4 0 25
5 0 16
6 3 110
7 16 200
> test.cl(y[samp,],fitted.values(model))
cres
true 3 7
1 5 108
2 6 130
3 119 159
4 0 47
5 4 35
6 4 285
7 33 424
The misclassification rate for ANN on testing data is about 60.01%.
It is seen that the misclassification rate of the model built by using artificial neural networks is very high. The least error
rate achieved is by using the nnet package, which is about 52%.
Logistic regression
We developed 7 different logistic models for classification of each type of defect as we noticed that each defect were
reliant on a different set of predictors. We aim to develop a hierarchical model in which we detect if a defect is
present or not; if present, we end the program (Dataset implies that each steel plate has only one kind of defect); and
if absent, we continue and check the presence of the next type of defect. And so on.
Logistic regression model for the first type of defect: Pastry
>train_pastry=train[,-c(29,30,31,32,33,34,35)]
>fix(train_pastry)
>test_pastry= test[,-c(29,30,31,32,33,34,35)]
>log_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orienta
tion_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minim
um_of_Luminosity,data=train_pastry,family = "binomial")
>summary(log_pastry)
Call:
glm(formula = Pastry ~ LogOfAreas + TypeOfSteel_A300 + Sum_of_Luminosity +
Log_X_Index + Square_Index + Orientation_Index + Log_Y_Index +
Maximum_of_Luminosity + X_Maximum + X_Minimum + Length_of_Conveyor +
Minimum_of_Luminosity, family = "binomial", data = train_pastry)
Deviance Residuals:

Min 1Q Median 3Q Max
-2.01159 -0.26886 -0.05515 0.00000 3.08897
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.178e+01 3.366e+00 -3.500 0.000465 ***
LogOfAreas 4.551e+00 2.015e+00 2.258 0.023940 *
TypeOfSteel_A300 -5.827e-01 3.077e-01 -1.894 0.058260 .
Sum_of_Luminosity 7.952e-06 2.324e-06 3.422 0.000621 ***
Log_X_Index 1.377e+01 6.238e+00 2.207 0.027308 *
Square_Index -4.378e+00 1.186e+00 -3.692 0.000223 ***
Orientation_Index 4.450e+00 1.267e+00 3.512 0.000445 ***
Log_Y_Index -9.260e+00 2.309e+00 -4.010 6.07e-05 ***
Maximum_of_Luminosity 3.396e-02 9.677e-03 3.510 0.000449 ***
X_Maximum -7.516e-01 2.283e-01 -3.292 0.000995 ***
X_Minimum 7.521e-01 2.283e-01 3.294 0.000987 ***
Length_of_Conveyor 3.830e-03 8.421e-04 4.548 5.41e-06 ***
Minimum_of_Luminosity -4.196e-02 8.145e-03 -5.151 2.59e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 763.76 on 1357 degrees of freedom
Residual deviance: 427.59 on 1345 degrees of freedom
AIC: 453.59
Number of Fisher Scoring iterations: 14
>log_pastry_pred = predict(log_pastry, test_pastry, type ="response")
>log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment
>log_pastry_pred_y[log_pastry_pred> 0.5]= 1
>table(log_pastry_pred_y, test_pastry[,28])
log_pastry_pred_y 0 1
0 528 33
1 7 15
>mean(log_pastry_pred_y != test_pastry[,28])
[1] 0.06861063
We see that the misclassification error rate is less than 7% which is acceptable for this
individual model. We have also cross-validated these results using K-Fold cross validation
Technique.
># cross validation
>cv_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orienta
tion_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minim
um_of_Luminosity,data=train_pastry,family = "binomial")
>cv.glm(train_pastry,cv_pastry,K=10)$delta[1]
1] 0.06969064
Similarly, individual classification models were developed for rest of the defects and the results obtained are
tabulated below:

The combined accuracy of the hierarchical model would be = (1-0.068)*(1- 0.041)*(1-0.045)*(1-0.0189)*(1-0.031)*(1-
0.16)*(1-0.252) = 0.51= 51%
* Since the defects are independent of each other the probabilities of the individual models being right will be
multiplied.
Random Forest
Individual responses were then modeled with random forest to get the respective error rates.
> ##randomforest with Pastry only
>train_pastry$Pastry=factor(train_pastry$Pastry)
>test_pastry$Pastry=factor(test_pastry$Pastry)
>rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE)
>yhat.rf_pastry = predict (rf_pastry ,test_pastry)
>table(yhat.rf_pastry, test_pastry[,28])
yhat.rf_pastry 0 1
0 529 32
1 6 16
>mean(yhat.rf_pastry!=test_pastry[,28])
[1] 0.0651801
Defect Confusion matrix Error rate CV error
0 528 33
1 7 15
log_zs_pred_y 0 1
0 511 9
1 15 48
log_ks_pred_y 0 1
0 453 19
1 7 104
log_stains_pred_y 0 1
0 554 9
1 2 18
log_dirt_pred_y 0 1
0 554 12
1 6 11
log_bumps_pred_y 0 1
0 447 68
1 25 43
log_of_pred_y 0 1
0 346 104
1 43 90
0.176
0.069
0.030
0.018
0.020
0.015
0.124
All Faults
0.068
0.041
0.045
0.0189
0.031
0.16
0.252
Pastry
Z_Scratch
K_Scratch
Stains
Dirtiness
Bumps

Similarly, random forest models were developed for other individual defects with the
results tabulated below:
The combined accuracy of the hierarchical model would be = (1-0.065)*(1- 0.022)*(1-0.0257)*(1-0.005)*(1-0.0189)*(1-
0.111)*(1-0.17) = 0.6417= 64.17%
multiplied.
Principal Component Analysis
A dimensional reduction technique was conducted on the dataset to apply the 80/20 rule and extract the “Vital Few”
from the “Trivial Many” prediction terms.
>#PCA on complete data set
>datap=data[,-(28:35)]
>fit <- princomp(datap, cor=TRUE)
>summary(fit) # print variance accounted for
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Comp.7 Comp.8 Comp.9
Standard deviation 2.8815442 1.8493487 1.6443472 1.49596665 1.40409954 1.27423486
1.17387303 0.99985628 0.96006830
Defect Confusion matrix Error rate
yhat.rf_pastry 0 1
0 529 32
1 6 16
yhat.rf_zs 0 1
0 522 9
1 4 48
yhat.rf_ks 0 1
0 459 14
1 1 109
yhat.rf_stains 0 1
0 556 3
1 0 24
yhat.rf_dirt 0 1
0 558 9
1 2 14
yhat.rf_bumps 0 1
0 460 53
1 12 58
yhat.rf_of 0 1
0 363 73
1 26 121
Other Faults
0.065
0.022
0.0257
0.0051
0.0189
0.111
0.17
Pastry
Z_Scratch
K_Scratch
Stains
Dirtiness
Bumps

Proportion of Variance 0.3075295 0.1266700 0.1001436 0.08288579 0.07301835 0.06013609
0.05103622 0.03702639 0.03413819
Cumulative Proportion 0.3075295 0.4341995 0.5343431 0.61722894 0.69024729 0.75038338
0.80141960 0.83844599 0.87258418
Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
Comp.16 Comp.17
Standard deviation 0.88369299 0.84586524 0.73974691 0.62701635 0.54299087 0.489154397
0.434472983 0.316363504
Proportion of Variance 0.02892271 0.02649956 0.02026761 0.01456109 0.01091997 0.008861927
0.006991362 0.003706884
Cumulative Proportion 0.90150689 0.92800645 0.94827406 0.96283515 0.97375512 0.982617047
0.989608409 0.993315293
Comp.18 Comp.19 Comp.20 Comp.21 Comp.22
Comp.23 Comp.24 Comp.25
Standard deviation 0.243539500 0.235660669 0.211674618 0.1093589373 0.0837014355
3.693381e-02 2.217081e-02 3.543139e-03
Proportion of Variance 0.002196722 0.002056887 0.001659487 0.0004429399 0.0002594789
5.052246e-05 1.820536e-05 4.649569e-07
Cumulative Proportion 0.995512015 0.997568902 0.999228388 0.9996713283 0.9999308072
9.999813e-01 9.999995e-01 1.000000e+00
Comp.26 Comp.27
Standard deviation 5.417823e-06 1.218954e-08
Proportion of Variance 1.087141e-12 5.503141e-18
Cumulative Proportion 1.000000e+00 1.000000e+00
>plot(fit,type="lines") # scree plot
>biplot(fit)
From the analysis and the scree plot first 7 principal components were selected as they seemed to explain just over
80% of the variability in the sample space.
The principal components were then extracted and were stored in two different files:
 One with only the top 7 principal components
 Another with the original data and the top 7 principal components combined
>axes<- predict(fit, newdata = datap)
>fix(axes)

> data1=axes[,1:7]
>fix(data1)
>write.csv(data1,file="pcadata.csv") #data file with the top 7 PCs
>data2=data.frame(data,data1)
>write.csv(data2,file="comb_data.csv")#data file with the original data and the 7 PCs combined
These two data files were used for further modelling.
Model Formation with principal components
Logistic Regression
Logistic Regression was performed on the extracted principal components for individual responses.
Logistic regression model for the first type of defect: Pastry (Using the PCs)
>#######################logistic regression
> #00000000000000000000_pastry
>train_pastry=train[,-c(9,10,11,12,13,14,15)]
>fix(train_pastry)
>test_pastry= test[,-c(9,10,11,12,13,14,15)]
>log_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")
>summary(log_pastry)
Call:
glm(formula = Pastry ~ ., family = "binomial", data = train_pastry)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5477 -0.4039 -0.1326 -0.0243 3.5952
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.51776 0.38758 -11.656 < 2e-16 ***
Comp.1 -0.63814 0.11686 -5.461 4.75e-08 ***
Comp.2 -1.17226 0.14334 -8.178 2.88e-16 ***
Comp.3 -0.37291 0.08243 -4.524 6.07e-06 ***
Comp.4 -0.25626 0.08068 -3.176 0.00149 **
Comp.5 0.42288 0.07491 5.645 1.65e-08 ***
Comp.6 0.17290 0.09726 1.778 0.07546 .
Comp.7 0.40363 0.10328 3.908 9.31e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 763.76 on 1357 degrees of freedom
Residual deviance: 541.13 on 1350 degrees of freedom
AIC: 557.13
Number of Fisher Scoring iterations: 8
>log_pastry_pred = predict(log_pastry, test_pastry, type ="response")
>log_pastry_pred_y = rep(0, length(test_pastry[,8])) # default assignment
>log_pastry_pred_y[log_pastry_pred> 0.5]= 1
>table(log_pastry_pred_y, test_pastry[,8])
0 529 44
1 6 4

>mean(log_pastry_pred_y != test_pastry[,8])
[1] 0.08576329
> # cross validation
>cv_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")
>cv.glm(train_pastry,cv_pastry,K=10)$delta[1]
[1] 0.06398761
Similarly, individual classification models were developed for rest of the defects and the results obtained are
tabulated below:
The combined accuracy of the hierarchical model would be = (1-0.0857)*(1- 0.0857)*(1-0.0634)*(1-0.0172)*(1-
0.0446)*(1-0.19)*(1-0.285) = 0.4258= 42.58%
multiplied.
Random Forest
The dataset consisting of the principal components was then used with a Random Forest model.
> #randomforest with important predictors
>set.seed (1)
> #randomforest with Pastry only
>set.seed (1)
>train_pastry$Pastry=factor(train_pastry$Pastry)
>test_pastry$Pastry=factor(test_pastry$Pastry)
Defect Confusion matrix Error rate CV error
0 529 44
1 6 4
log_zs_pred_y 0 1
0 506 30
1 20 27
log_ks_pred_y 0 1
0 452 29
1 8 94
log_stains_pred_y 0 1
0 553 7
1 3 20
log_dirt_pred_y 0 1
0 557 23
1 3 0
log_bumps_pred_y 0 1
0 447 86
1 25 25
log_of_pred_y 0 1
0 344 121
1 45 73
Other Faults 0.285
0.064
0.0525
0.0257
0.0157
0.0224
0.14
0.198
Stains 0.0172
Dirtiness 0.0446
Bumps 0.19
Pastry 0.0857
Z_Scratch 0.0857
K_Scratch 0.0634

>rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE)
>yhat.rf_pastry = predict (rf_pastry ,test_pastry)
>table(yhat.rf_pastry, test_pastry[,8])
yhat.rf_pastry 0 1
0 529 38
1 6 10
>mean(yhat.rf_pastry!=test_pastry[,8])
[1] 0.0754717
With the model implementation showing that random forest was giving better results (and rightly so) it was decided
to model all the individual responses with random forest using both the datasets (with only the 7 PCs and the one
with the original predictors + 7 PCs).
The results obtained are tabulated below.
S No. Typeofdefect using 1stsevenPC's using all27predictors
Using allpredictorsand7
PC's
1 Pastry(A) 0.075 0.065 0.065
2 Z-scratch(B) 0.046 0.022 0.024
3 K-Scratch(C) 0.036 0.026 0.027
4 Stains(D) 0.012 0.005 0.005
5 Dirtiness(E) 0.019 0.019 0.015
6 Bumps(F) 0.042 0.111 0.127
7 OtherDefects(G) 0.196 0.170 0.168
0.635 0.641 0.631
Misclassificationerrorratesfor individualRANDOMFORESTSwithdifferentpredictionterms
Accuracy forthecombinedmodel

RESULTS
ROC Analysis
ROC analysis was conducted on the different multiclass models to aid us in selecting the best model.
S No.
Modelling
Technique
Confusion Matrix
Missclassification
Error rate
ROC curve AUC*
1 LDA 0.324 0.065 0.790
2
Decision
Tree (After
Pruning)
0.360 0.784
3 Bagging 0.208 0.824
4
Random
Forest
0.228 0.797
5 SVM 0.276 0.804
6
Neural
Network
Analysis
0.539 0.605
7 C 5.0 0.194 0.831
Comparision of different Multiclass Models

The C5.0 Decision Tree had the best performance on the testing dataset.
ROC Analysis was also conducted for the logistic regression and random forest models that were developed for each
individual defects, the results are tabulated below:
ROC for Individual defects using Logistic Regression:
ROC Curves
AUC Values
Logistic reg
AUC
Other Faults
0.65 0.91 0.92 0.83 0.73 0.67 0.68
Pastry Z_Scratch K_Scratch Stains Dirtiness Bumps

ROC for individual defects using Random Forest
ROC Curves
AUC Values
Random Forest
AUC 0.67 0.68
K_Scratch Stains Dirtiness Bumps Other Faults
0.65 0.91 0.92 0.83 0.73
Pastry Z_Scratch

CONCLUSION
The major takeaways from this project were:
 Advanced decision trees such as C5.0 DT and Random Forest are the most efficient techniques in dealing with
multiclass anomaly detection using machine learning.
 Although modelling for individual defects gives a very high accuracy rate for almost every one of them, the
combined hierarchical model that will utilize them in practice will not be as efficient due to the fact that its
accuracy is a product of the individual accuracies of the models used in the hierarchical model.
 Logistic Regression although a very powerful tool doesn’t seem to be a good fit for multiclass anomaly detection
problems due to the fact that the logistic regression model does not predict the type of defect but rather the
probability of that defect occurring using the log likelihood function.
 SVM also turned out as a good tool for conducting multi classification as its accuracy rate was very high, but we
still prefer C5.0 over SVM as it was not the best because of very high number of support vectors.
 Artificial Neural Network results on this dataset were not satisfactory with very high misclassification rate. This
can be because of many reasons but the major reason is a small dataset. Also there is no method to fine tune
the number of hidden layers and a slight change in the number of hidden layer causes significant
misclassification. So it can be said that either the model was right value of parameters even after a lot of trial
and errors or the model was not trained properly because of the small dataset.
Future Scope:
The dataset considered in this project uses multiclass classification techniques as the defects are not co-related. Also, as
the defects are not co-related some techniques are performed in a multi univariate technique that is using different
models for each fault. Now, Multi label classification is where same predictor values are causing 2 or more defects at a
time which is not the case in this particular dataset thus not required and hence not used. Thus this project and the
results are limited to multi class classification when the faults are not co-related. So for further data if the defects
become co-related, multi-label classification has to be used which becomes the future scope of this project.

References
-Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system",
Measurement, vol. 43, no. 4, pp. 513-519, 2010.
Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International
Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13.
A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods
and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005.
Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science,
vol. 8, no. 4, pp. 506-514, 2012.
S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International
Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013.
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms. Proceedings Of
The 23Rd International Conference On Machine Learning - ICML '06. http://dx.doi.org/10.1145/1143844.1143865
Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1, June 2010
Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults
detection. International Journal of Technical Research and Applications, 2(4), 127-131.
Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons.
Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969.
Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation
Study.Metodoloski zvezki, 1(1), 143-161.
Neural Network Primer: Part I" by Maureen Caudill, AI Expert, Feb. 1989
http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf
http://saiconference.com/Downloads/SpecialIssueNo10/Paper_3A_comparative_study_of_decision_tree_ID3_and_C4.5
.pdf
www.cs.princeton.edu
APPENDIX
1. On Original Dataset
library(ISLR)
library(boot)

library(MASS)
data=read.csv(file.choose(), header=T)
attach(data)
data$alldefects="A"
for(i in 1:1941) {
if ( Z_Scratch[i]==1) {data$alldefects[i]="B"}
if ( K_Scratch[i]==1) {data$alldefects[i]="C"}
if ( Stains[i]==1) {data$alldefects[i]="D"}
if ( Dirtiness[i]==1) {data$alldefects[i]="E"}
if ( Bumps[i]==1) {data$alldefects[i]="F"}
if ( Other_Faults[i]==1) {data$alldefects[i]="G"} }
data$alldefects=factor(data$alldefects)
set.seed(1)
trainingsample=sample(1:nrow(data), size=0.70*nrow(data))
train=data[trainingsample,]
test=data[-trainingsample,]
write.csv(train,file="exportedtrainingdata.csv")
write.csv(test,file="exportedtestingdata.csv")
train2=train[,-(28:34)]
test2=test[,-(28:34)]
test.alldefects=test2[,28]
#LDA
lda.model= lda(alldefects~., data = train2)
lda_pred= predict(lda.model, test2)
table(lda_pred$class, test.alldefects)
mean(lda_pred$class!= test.alldefects)
mean(lda_pred$class== test.alldefects)
lda.cv=lda(alldefects~.,test2, CV=TRUE)
table(lda.cv$class,test.alldefects)
mean(lda.cv$class!= test.alldefects)
predictions <- as.numeric(lda_pred$class, type="response")
multiclass.roc(test.alldefects, predictions, plot=T)
y=rep(0,length(lda_pred$class))
y[lda_pred$class==test.alldefects]=1
x=rep(0,length(test.alldefects))
x[test.alldefects==test.alldefects]=1
roc(x,y,plot=TRUE,main="LDA")
predictions_lda <- as.numeric(lda_pred,type="vote")
multiclass.roc(test.alldefects, predictions_lda, plot=T)

#qda
qda.model= qda(alldefects~., data = train2)
qda_pred= predict(qda.model, test2)
table(qda_pred$class, test.alldefects)
mean(qda_pred$class!= test.alldefects)
##tree
library(tree)
tree1=tree(train2$alldefects~.,data=train2)
plot(tree1)
text(tree1 ,pretty =0)
tree.pred=predict(tree1,test2,type="class")
table(tree.pred ,test.alldefects)
mean(tree.pred!=test.alldefects)
predictions_tree <- as.numeric(tree.pred,type="response")
multiclass.roc(test.alldefects, predictions_tree, plot=T)
##pruning
set.seed (1)
cv.data =cv.tree(tree1 ,FUN=prune.misclass )
names(cv.data)
cv.data
par(mfrow =c(1,1))
plot(cv.data$size ,cv.data$dev ,type="b")
plot(cv.data$k ,cv.data$dev ,type="b")
prune.data = prune.misclass(tree1 ,best =9)
plot(prune.data)
text(prune.data,pretty =0)
tree.pred2=predict(prune.data , test2 ,type="class")
table(tree.pred2 ,test.alldefects)
mean(tree.pred2!=test.alldefects)
predictions_tree <- as.numeric(tree.pred2,type="response")
multiclass.roc(test.alldefects, predictions_tree, plot=T)
## bAGGING
set.seed (1)

bag.data
=randomForest(alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickness
+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , mtry=10,importance =TRUE)
bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE)
bag.data
yhat.bag = predict (bag.data ,test2)
plot(yhat.bag , test.alldefects)
abline (0,1)
table(yhat.bag, test.alldefects)
mean( yhat.bag!=test.alldefects)
predictions_bag <- as.numeric(yhat.bag,type="response")
multiclass.roc(test.alldefects, predictions_bag, plot=T)
#randomforest
set.seed (1)
library(randomForest)
rf =randomForest(alldefects~.,data=train2 , importance =TRUE)
yhat.rf = predict (rf ,test2)
table(yhat.rf, test.alldefects)
mean( yhat.rf !=test.alldefects)
predictions <- as.numeric(predict(rf, test2, type = 'response'))
multiclass.roc(test.alldefects, predictions, plot=T)
#randomforest with important predictors
set.seed (1)
library(randomForest)
rrf
=randomForest(alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickness
+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , importance =TRUE)
yhat.rrf = predict (rrf ,test2)
table(yhat.rrf, test.alldefects)
mean( yhat.rrf !=test.alldefects)
#randomforest with Pastry only
set.seed (1)
train_pastry$Pastry=factor(train_pastry$Pastry)
test_pastry$Pastry=factor(test_pastry$Pastry)
rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE)
yhat.rf_pastry = predict (rf_pastry ,test_pastry)
table(yhat.rf_pastry, test_pastry[,28])
mean( yhat.rf_pastry!=test_pastry[,28])
#randomforest with z_scratch only

set.seed (1)
train_zs$Z_Scratch=factor(train_zs$Z_Scratch)
test_zs$Z_Scratch=factor(test_zs$Z_Scratch)
rf_zs =randomForest(Z_Scratch~.,data=train_zs, importance =TRUE)
yhat.rf_zs = predict (rf_zs ,test_zs)
table(yhat.rf_zs, test_zs[,28])
mean( yhat.rf_zs!=test_zs[,28])
#randomforest with K_scratch only
set.seed (1)
train_ks$K_Scratch=factor(train_ks$K_Scratch)
test_ks$K_Scratch=factor(test_ks$K_Scratch)
rf_ks =randomForest(K_Scratch~.,data=train_ks, importance =TRUE)
yhat.rf_ks = predict (rf_ks ,test_ks)
table(yhat.rf_ks, test_ks[,28])
mean( yhat.rf_ks!=test_ks[,28])
#randomforest with stains only
set.seed (1)
train_stains$Stains=factor(train_stains$Stains)
test_stains$Stains=factor(test_stains$Stains)
rf_stains =randomForest(Stains~.,data=train_stains, importance =TRUE)
yhat.rf_stains = predict (rf_stains ,test_stains)
table(yhat.rf_stains, test_stains[,28])
mean( yhat.rf_stains!=test_stains[,28])
#randomforest with dirt only
set.seed (1)
train_dirt$Dirtiness=factor(train_dirt$Dirtiness)
test_dirt$Dirtiness=factor(test_dirt$Dirtiness)
rf_dirt =randomForest(Dirtiness~.,data=train_dirt, importance =TRUE)
yhat.rf_dirt = predict (rf_dirt ,test_dirt)
table(yhat.rf_dirt, test_dirt[,28])
mean( yhat.rf_dirt!=test_dirt[,28])
#randomforest with bumps only
set.seed (1)
train_bumps$Bumps=factor(train_bumps$Bumps)
test_bumps$Bumps=factor(test_bumps$Bumps)
rf_bumps =randomForest(Bumps~.,data=train_bumps, importance =TRUE)
yhat.rf_bumps = predict (rf_bumps ,test_bumps)
table(yhat.rf_bumps, test_bumps[,28])
mean( yhat.rf_bumps!=test_bumps[,28])
#randomforest with other faults only

set.seed (1)
train_of$Other_Faults=factor(train_of$Other_Faults)
test_of$Other_Faults=factor(test_of$Other_Faults)
rf_of =randomForest(Other_Faults~.,data=train_of, importance =TRUE)
yhat.rf_of = predict (rf_of ,test_of)
table(yhat.rf_of, test_of[,28])
mean( yhat.rf_of!=test_of[,28])
rf.cv=randomForest(train_of$Other_Faults~.,data=train_of, CV=TRUE)
table(rf.cv$class,train_of[,8])
r = randomForest(alldefects~., data = cadets, importance =TRUE, do.trace = 100)
varImpPlot(r)
r###################################################################logistic regression
#####################################################
#000000000000000000000000000000000000000000000000_pastry
train_pastry=train[,-c(29,30,31,32,33,34,35)]
fix(train_pastry)
test_pastry= test[,-c(29,30,31,32,33,34,35)]
attach(train_pastry)
attach(test_pastry)
log_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y
_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train
_pastry,family = "binomial")
summary(log_pastry)
log_pastry_pred = predict(log_pastry, test_pastry, type ="response")
log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment
log_pastry_pred_y[log_pastry_pred> 0.5]= 1
table(log_pastry_pred_y, test_pastry[,28])
mean(log_pastry_pred_y != test_pastry[,28])
#ROC
y=rep(0,length(log_pastry_pred_y))
y[log_pastry_pred_y==1]=1
x=rep(0,length(test_pastry[,28]))
x[test_pastry[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ PASTRY")
# cross validation
cv_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y
_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train
_pastry,family = "binomial")

cv.glm(train_pastry,cv_pastry,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_zs
train_zs=train[,-c(28,30,31,32,33,34,35)]
fix(train_zs)
test_zs= test[,-c(28,30,31,32,33,34,35)]
attach(train_zs)
attach(test_zs)
log_zs =
glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum
+Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino
sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial")
summary(log_zs)
log_zs_pred = predict(log_zs, test_zs, type ="response")
log_zs_pred_y = rep(0, length(test_zs[,28])) # default assignment
log_zs_pred_y[log_zs_pred> 0.5]= 1
table(log_zs_pred_y, test_zs[,28])
mean(log_zs_pred_y != test_zs[,28])
#ROC
y=rep(0,length(log_zs_pred_y))
y[log_zs_pred_y==1]=1
x=rep(0,length(test_zs[,28]))
x[test_zs[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Z_skratch")
#CV
log_zs=step(glm(Z_Scratch~.,data=train_zs,family="binomial"),direction="backward")
cv_zs =
glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum
+Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino
sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial")
cv.glm(train_zs,cv_zs,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_ks
train_ks=train[,-c(28,29,31,32,33,34,35)]
test_ks= test[,-c(28,29,31,32,33,34,35)]
attach(train_ks)
attach(test_ks)
log_ks =
glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+
X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index
+Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family =
"binomial")
summary(log_ks)
log_ks_pred = predict(log_ks, test_ks, type ="response")

log_ks_pred_y = rep(0, length(test_ks[,28])) # default assignment
log_ks_pred_y[log_ks_pred> 0.5]= 1
table(log_ks_pred_y, test_ks[,28])
mean(log_ks_pred_y != test_ks[,28])
log_ks=step(glm(K_Scratch~.,data=train_ks,family="binomial"),direction="backward")
#ROC
y=rep(0,length(log_ks_pred_y))
y[log_ks_pred_y==1]=1
x=rep(0,length(test_ks[,28]))
x[test_ks[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ k_skratch")
#CV
cv_ks =
glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+
X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index
+Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family =
"binomial")
cv.glm(train_ks,cv_ks,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_Stains
train_stains=train[,-c(28,29,30,32,33,34,35)]
test_stains= test[,-c(28,29,30,32,33,34,35)]
log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed
ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max
imum_of_Luminosity+Outside_X_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,
data=train_stains,family="binomial")
log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed
ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max
imum_of_Luminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data=
train_stains,family = "binomial")
summary(log_stains)
log_stains_pred = predict(log_stains, test_stains, type ="response")
log_stains_pred_y = rep(0, length(test_of[,28])) # default assignment
log_stains_pred_y[log_stains_pred> 0.5]= 1
table(log_stains_pred_y, test_stains[,28])
mean(log_stains_pred_y != test_stains[,28])
log_stains=step(glm(Stains~.,data=train_stains,family="binomial"),direction="backward")
#ROC
y=rep(0,length(log_stains_pred_y))
y[log_stains_pred_y==1]=1
x=rep(0,length(test_stains[,28]))

x[test_stains[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Stains")
# cross validation
cv_stains =
glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Edges_Index+
LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Maximum_of_L
uminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data=train_stain
s,family = "binomial")
cv.glm(train_stains,cv_stains,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_Dirtiness
train_dirt=train[,-c(28,29,30,31,33,34,35)]
test_dirt= test[,-c(28,29,30,31,33,34,35)]
log_dirt =
glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M
aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial")
summary(log_dirt)
log_dirt_pred = predict(log_dirt, test_dirt, type ="response")
log_dirt_pred_y = rep(0, length(test_dirt[,28])) # default assignment
log_dirt_pred_y[log_dirt_pred> 0.5]= 1
table(log_dirt_pred_y, test_dirt[,28])
mean(log_dirt_pred_y != test_dirt[,28])
log_dirt=step(glm(Dirtiness~.,data=train_dirt,family="binomial"),direction="backward")
# cross validation
cv_dirt =
glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M
aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial")
cv.glm(train_dirt,cv_dirt,K=10)$delta[1]
y=rep(0,length(log_dirt_pred_y))
y[log_dirt_pred_y==1]=1
x=rep(0,length(test_dirt[,28]))
x[test_dirt[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Dirtiness")
#000000000000000000000000000000000000000000000000_Bumps
train_bumps=train[,-c(28,29,30,31,32,34,35)]
test_bumps= test[,-c(28,29,30,31,32,34,35)]
log_bumps =
glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate
_Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+
Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial")
summary(log_bumps)
log_bumps_pred = predict(log_bumps, test_bumps, type ="response")

log_bumps_pred_y = rep(0, length(test_of[,28])) # default assignment
log_bumps_pred_y[log_bumps_pred> 0.5]= 1
table(log_bumps_pred_y, test_bumps[,28])
mean(log_bumps_pred_y != test_bumps[,28])
log_bumps=step(glm(Bumps~.,data=train_bumps,family="binomial"),direction="backward")
# cross validation
cv_bumps =
glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate
_Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+
Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial")
cv.glm(train_bumps,cv_bumps,K=10)$delta[1]
y=rep(0,length(log_bumps_pred_y))
y[log_bumps_pred_y==1]=1
x=rep(0,length(test_bumps[,28]))
x[test_bumps[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ bumps")
#000000000000000000000000000000000000000000000000_otherfaults
train_of=train[,-c(28,29,30,31,32,33,35)]
test_of= test[,-c(28,29,30,31,32,33,35)]
log_of =
glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey
or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel
_Plate_Thickness,data=train_of,family = "binomial")
summary(log_of)
log_of_pred = predict(log_of, test_of, type ="response")
log_of_pred_y = rep(0, length(test_of[,28])) # default assignment
log_of_pred_y[log_of_pred> 0.5]= 1
table(log_of_pred_y, test_of[,28])
mean(log_of_pred_y != test_of[,28])
log_of=step(glm(Other_Faults~.,data=train_of,family="binomial"),direction="backward")
#ROC
y=rep(0,length(log_of_pred_y))
y[log_of_pred_y==1]=1
x=rep(0,length(test_of[,28]))
x[test_of[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ other faults")
# cross validation
cv_of =
glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey

or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel
_Plate_Thickness,data=train_of,family = "binomial")
cv.glm(train_of,cv_of,K=10)$delta[1]
##########PCA
#PCA on complete data set
datap=data[,-(28:35)]
fit <- princomp(datap, cor=TRUE)
summary(fit) # print variance accounted for
loadings(fit) # pc loadings
plot(fit,type="lines") # scree plot
fit$scores # the principal components
biplot(fit)
axes <- predict(fit, newdata = datap)
head(axes, 4)
fix(axes)
data1=axes[,1:7]
write.csv(data1,file="pcadata.csv")
data2=data.frame(data,data1)
write.csv(data2,file="comb_data.csv")
#SVM
install.packages("e1071")
library(e1071)
svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15)
summary(svm.fit)
predicted=predict(svm.fit,test2)
table(predicted,test.alldefects)
mean(predicted!=test.alldefects)
plot(svm.fit,train2,Length_of_Conveyor~X_Maximum,slice=list(X_Perimeter=3,Y_Perimeter=4),svSymbol=1,dataSymbol
=2,color.palette=terrain.colors )
# ROC
predictions=as.numeric(predicted,type="response")
multiclass.roc(test.alldefects,predictions,plot=T,main="ROC for SVM")
#ANN
library(nnet)
train.nnet<-nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000)
test.nnet<-predict(train.nnet,test2,type=("class"))
table(test2$alldefects,test.nnet)
mean(test.nnet!=test2$alldefects)
library(pROC)
predictions=as.numeric(test.nnet,type="response")

multiclass.roc(test2$alldefects,predictions,plot=T,main="ROC for ANN")
#read data stored in CSV file.
data=read.csv("Steel_faults.csv",header=TRUE)
attach(data)
x=data[,1:27] # input variables
y=data[,28:34] # response variables
n=1941 # total number of observations
n1=round(n*0.7) # number of observations for training
samp=sample(1:n,n1,replace=FALSE) # to select random observation
## following is the userdefined function to obtain confusion matrix
test.cl = function(true, pred) {
true = max.col(true)
cres = max.col(pred)
table(true, cres)
}
###another package for NNA
install.packages("RSNNS")
library(RSNNS)
model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F)
model=mlp(train2[,-28], train2[,28], size=2,linOut=F)
#library(devtools)
#plot.nnet(model)
test.cl(y[-samp,], predict(model, x[-samp,])) #confusion matrix for training data
test.cl(y[samp,],fitted.values(model)) #confusion matrix for testing data
#C50
crx <- data[ sample( nrow( data ) ), ]
X <- crx[,1:27]
y <- crx[,35]
trainx<- X[1:1358,]
trainy <- y[1:1358]
testx <- X[1358:1941,]
testy <- y[1358:1941]
model <- C5.0( trainx, trainy, trials=75 )
p <- predict( model, testx, type="class" )
sum( p == testy ) / length( p )
table(p, testy)
mean(p != testy)
predictions_c5 <- as.numeric(p,type="response")
multiclass.roc(testy, predictions_c5, plot=T)

ISEN 613_Team3_Final Project Report

ISEN 613_Team3_Final Project Report

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (8)

Similar to ISEN 613_Team3_Final Project Report

Similar to ISEN 613_Team3_Final Project Report (20)

ISEN 613_Team3_Final Project Report