SlideShare a Scribd company logo
1 of 75
Download to read offline
ISEN 613- Engineering Data Analysis
Naman Kapoor
Vinayak Nair
Rahul Garg
Omkar Deshpande
Adriana De La Cruz
Multi-Attribute Classification of
Steel Plate Defects
Team 3
Executive Summary
Anomaly detection is vital in the industry and can be the difference between success and bankruptcy. Manufacturing
processes need to be continuously monitored so that any change in the process can be quickly identified and controlled
so that there is no production loss.
This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into
consideration the available historical data. The main objective of this project is to compare the working of different
classification models and decide one final model that will have the least misclassification rate (high prediction accuracy).
Many studies have been done by researchers on the comparative performances of multiclass classification techniques.
This project adds a new dimension by drawing comparisons between error rates of multiclass techniques and individual
classification techniques for each class. Although modelling for individual defects gives a very high accuracy rate, the
combined practical hierarchical model will not be as efficient due to the fact that its accuracy is a product of the
individual accuracies of the models used in the hierarchical model.
In this project, performances of techniques such as Linear Discriminant Analysis, Logistic Regression (individual and
multivariate), Random Forests (individual and multivariate), Single Decision Trees, Bagging, Support Vector Machines,
and Artificial Neural Networks have been compared and analyzed. Principal Component Analysis was also used to reduce
the dimensions of the given data.
The challenge was to decide whether to consider different models for all seven defects or to build a model for all defects
combined. The dataset also had many attributes and hence, it was difficult to select the most significant predictors and
avoid over-fitting. As there were more than two responses the coding that was used for binary classification wasn’t
applicable and had to deal with multi classification and new coding techniques had to be explored. Methods like
Artificial Neural Networks and C5.0 were used. These methods were completely new and efforts were required in terms
of literature review and coding to implement them.
The following table gives the misclassification error rate and area under curve for different modeling techniques used to
model the problem.
Modeling Technique Misclassification Error Rate
(percentage)
Area Under Curve
LDA 32.4 0.790
Decision Tree 36.0 0.784
Bagging 20.8 0.824
Random Forest 22.8 0.797
SVM 27.6 0.804
Neural Network Analysis 53.9 0.605
C5.0 19.4 0.831
From the above table it is clear that C5.0 modeling technique gives the least misclassification error which is 19.4 % and
also highest area under the ROC curve.
The main objective of this project was to compare the working of different classification models built using different
modeling techniques and propose one final model that will have the least misclassification rate (high prediction
accuracy). Thus, the results show the successful completion of the objectives. The final model proposed to be
implemented to predict the faults is C5.0. This method has an 81.6% prediction accuracy.
INTRODUCTION
Importance of the problem:
Present era is the era of quality and in today’s world of cut-throat competition and large scale production, only those
manufacturers survive who can provide good quality products and services that meet or exceed the expectations of the
customers. There is a need to continuously monitor the ongoing manufacturing processes so that any change in the
process can be identified quickly and rectified to prevent production loss. In manufacturing, operations managers can
use advanced analytics to take a deep dive into historical process data, identify patterns and relationships among
discrete process steps and inputs, and then optimize the factors that prove to have the greatest effect on yield. Many
global manufacturers in a range of industries and geographies now have an abundance of real-time shop-floor data and
the capability to conduct such sophisticated statistical assessments. They are taking previously isolated data sets,
aggregating them, and analyzing them to reveal important insights. In the steel industry, specifically alloy steel, creating
different defective products can impose a high cost for steel product manufacturer. One common fault out of all others
in producing low carbon steel grades is Pits & Blister defect. To remove this drawback, we need to grind the surface of
the steel product. Grinding cause waste of time and involved cost of the production will be increased. Incidence of
defects analysis is related to numerous factors including material analysis, production processes etc. So, if we are able to
correctly predict these defects based on the important parameters then in a way we know which of the parameters to
be controlled with high level of accuracy to minimize the defects and hence the defects. Thus the problem at hand in
this project deals with the data from a steel industry and the results obtained from this project can be used to predict
the faults and implement necessary changes.
Objective:
This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into
consideration the available historical data. The main objective of this project is to compare the working of different
classification models built using different classification techniques and propose one final model that will have the least
misclassification rate (high prediction accuracy). Various data mining techniques can be used to predict the steel
plate faults from the given data. In this project, the results of classification techniques such as Linear Discriminant
Analysis, Logistic Regression (individual and multivariate), Random Forests (individual and multivariate), Single Decision
Trees, Bagging, Support Vector Machines, and Artificial Neural Networks have been compared and the best model is
proposed. The model building also uses Principal Component Analysis to reduce the dimensions of the given data.
Scope of Work:
13th-15th 16th-19th 20th-23rd 24th-27th 28th-31st 1st-4th 5th-8th 9th-12th 12th-15th
1 Retrievingdataandunderstandingitsdetails
2 LiteratureReview&selectingsuitablesupervisedneuralnetworkmethod
3 Modelbuildingusingclassificationtechniqueslearnedinclass
4 Modelbuildingusingselectedneuralnetworkmethod:
5 Predictingresultsandconcludingthebestmodelingmethod
6 Reportmaking&documentation
SNo Actvity
November'2015
GANTChart
December'2015
LITERATURE REVIEWS
Following were the papers selected:
1. Steel Plates Faults Diagnosis with Data Mining Models. Fakhr,M., Elsayad, A. M. (2012). (Reviewed by Naman Kapoor)
2. Machine Learning Techniques for Anomaly Detection: An Overview. Omar,S. Ngadi,A. and Jebur, H. H. (2013).
(Reviewed by Naman Kapoor)
3. Neuralnet: Training of neural networks. Frauke Günther and Stefan Fritsch (2008). (Reviewed by Omkar
Deshpande)
4. An Empirical Comparison of Supervised Learning Algorithms. Caruana, R., & Niculescu-Mizil, A. (2006). (Reviewed
by Omkar Deshpande)
5. A SVM-based pipeline leakage detection and pre-warning system. Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin.
(2010). (Reviewed by Rahul Garg)
6. Steel faults diagnosis under predictive analysis. Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, (2013).
(Reviewed by Rahul Garg)
7. Classification of EEG signals using neural network and logistic regression. A. Subasi and E. Erçelebi. (2005).
(Reviewed by Rahul Garg)
8. A study of decision tree ensembles and feature selection for steel plates faults detection. Halawani, M. (2014).
(Reviewed by Vinayak Nair)
9. Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Tsoi, A.C., Pearson, R.A.
(1991). (Reviewed by Vinayak Nair)
10. Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study. Pohar, M., Blas, M., &
Turk, S. (2004). (Reviewed by Vinayak Nair)
Combined Takeaways
 Advanced decision trees are extremely efficient modeling techniques for multiclass classification problems.
 Artificial Neural Networks are a very powerful and complex algorithms but they have certain issues of
convergence and variable selection which need to be addressed.
 Supervised machine learning techniques significantly outperform the unsupervised ones when it comes to multi
classification problems.
 LDA is advisable in comparison to logistic regression, when the variables are normally distributed.
Reviewed by: Naman Kapoor
Steel Faults Diagnosis with Data Mining Models
-Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science, vol. 8, no. 4,
pp. 506-514, 2012.
Objective:
The key problem this paper addresses is the formation of an appropriate intelligent data mining model for anomaly
detection in the manufacturing industry on a particular dataset. Addressing this problem is important due to the need to
create intelligent fault diagnostic models with the help of data mining to enhance the quality of manufacturing and to
lessen the cost of product testing. It not only helps to keep away product quality problems but also facilitates
precautionary maintenance. The key objective of this paper is to use predictive analytics to select the best classification
model for the selected steel plate faults detection dataset by comparing different models using certain statistical
measures. The authors have addressed this problem by evaluating the performances of three of the popular and
effective data mining models (using supervised learning techniques) on the selected dataset and have presented their
views and outcomes on these. From their approach the authors found that the C5.0 decision tree with boosting achieved
the best results on the dataset which implies that decision trees have a greater impact on fault diagnosis than fellow
supervised learning techniques.
Approach:
The authors approached the problem by performing three multi classification techniques namely C5.0 decision tree
(C5.0 DT) with boosting, Multi Perception Neural Network (MLPNN) with pruning and Logistic Regression (LR) with step
forward on the steel plates fault dataset obtained from the University of California at Irvine (UCI) machine learning
repository. These models were formulated to diagnose seven commonly occurring faults of steel plate namely: Pastry,
Z_Scratch, K_Scratch, Stains, Dirtiness, Bumps and other faults. A brief description of the techniques used is presented
below:
I. C5.0 decision tree
The C5.0 DT algorithm is an improved version of the C4.5 and ID3 algorithms. C5.0 uses information gain as a measure of
purity, which is based on the notion of entropy. This method proved to be a major take out for us for our project.
The three methods used in the C5.0 tree construction are boosting, pruning and winnowing. While boosting and pruning
were known to us we were introduced to the concept of winnowing which preselects a subset of the attributes that are
selected to construct the tree. Winnowing ensures that the attributes which are irrelevant are excluded from the tree
building process. The authors used only 13 attributes (from 27) from the dataset to build the C5.0 tree.
II. Multilayer Perceptron Neural Network (MLPNN)
Artificial Neural Networks (ANNs) are biologically motivated and highly sophisticated analytical techniques capable of
modelling extremely complex nonlinear functions. The technique used to address this problem: MLPNN is considered a
powerful function approximate for prediction and classification problems, its structure is organized into layers of
neurons input, output and hidden layer. The MLPNN was trained using the Back Propagation (BP) training technique.
In this study the network was trained using the pruning approach which starts with a large network and removes the
weakest neurons (prunes) in the hidden and input layers as the training proceeds.
III. Logistic Regression
Logistic regression is a nonlinear regression technique for the prediction of dichotomous (binary) class attribute in terms
of the predictive ones. This algorithm does not predict the class attribute but predicts the odds of its occurrence using
the log likelihood (logit) function.
Results
The performance of each model was evaluated using three statistical measures: classification accuracy (compliment of
misclassification error rate), sensitivity and specificity. These measures are define using the values of True Positive (TP),
True Negative (TN), False Positive (FP) and False Negative (FN).
These charts depict that the performances of the C5.0 learning algorithm is the best model for training and test subsets.
Neural network model is the second best and finally the logistic regression is the worst one.
Summary
The major takeaways from this study were as follows:
 Advanced Decision Trees (C5.0 DT) are a very powerful data mining tool to use in predictive analytics of
multiclass anomaly detection with very high accuracy.
 Multilayer Perceptron Neural Networks with back propagation is a complex algorithm which standard and
simple to implement, it still suffers from convergence issues and requires initialization and adjustment of many
individual parameters to optimize its performance.
 Logistic Regression while although a very powerful modeling tool, assumes that the class attribute (the log odds,
not the event itself) is linear in the coefficients of the predictive attributes. The right inputs must be chosen with
their functional relationship to the class attribute.
 Amount, quality and the measuring process of data are key components of diagnostic accuracy.
Reviewed by: Naman Kapoor
Machine Learning Techniques for Anomaly Detection: An Overview
-S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International
Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013.
Objective:
The key problem this paper addresses is the issue of anomaly detection in the industry. Through this paper the authors
try to aid anomaly detection with the aid of machine learning techniques. The key reason why addressing this problem is
important because even though after many years of research the anomaly detection community is still confronting
difficult problems, the authors aim to further research this issue. The key objective of his paper is to present an overview
of research directions for applying supervised and unsupervised methods for managing the problem of anomaly
detection. The authors address this problem by providing a general architecture of anomaly intrusion detection systems
and conduct detailed discussions on the various machine learning techniques that come under supervised and
unsupervised learning and discussing their strengths and weaknesses in handling anomaly detection.
Approach:
The authors approached the problem by comparing different techniques under supervised and unsupervised machine
learning techniques and bringing out their strengths and weaknesses on anomaly detection. An overview of the two
approaches is given below:
I. Supervised Anomaly Detection
Supervised methods (also known as classification methods) required a labelled training set containing both normal and
anomalous samples to construct the predictive model. Theoretically, supervised methods provide better detection rate
than semi-supervised and unsupervised methods, since they have access to more information. However, there exist
some technical issues, which make these methods seem not accurate as they are supposed to be.
II. Unsupervised Anomaly Detection
These techniques do not need training data. As alternative, they based on two basic assumptions. First, they presume
that most of the network connections are normal traffic and only a very small traffic percentage is abnormal. Second,
they anticipate that malicious traffic is statistically various from normal traffic. According to these two assumptions, data
groups of similar instances which appear frequently are assumed to be normal traffic, while infrequently instances which
considerably various from the majority of the instances are regarded to be malicious.
The different techniques compared are shown in the table below:
Supervised Macine Learning Unsupervised Macine Learning
K-Nearest Neighbours Self Organising Maps
Neural Networks K-means Clustering
Decision Trees Fuzzy C-means Clustering
Support Vevtor Machines Expectation-Maximization Meta
Machine Learning Techniques for Anomaly Detection
Results
The results are shown in the table below:
Summary
The major takeaways from this review were:
 Machine learning techniques have received considerable attention among the anomaly detection researchers.
 Anomaly detection comprises supervised techniques and unsupervised techniques.
 The experiments demonstrated that the supervised learning methods significantly outperform the unsupervised
ones if the test data contains no unknown attacks.
 Among the supervised methods, the best performance is achieved by the non-linear methods, such as SVM,
multi-layer perceptron and the rule-based methods.
 Techniques for unsupervised such as K-Means, SOM, and one class SVM achieved better performance over the
other techniques although they differ in their capabilities of detecting all attacks classes efficiently.
An Empirical Comparison of Supervised Learning Algorithms
Reviewed by: Omkar Deshpande
Reference: Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms.
Proceedings Of The 23Rd International Conference On Machine Learning - ICML '06.
http://dx.doi.org/10.1145/1143844.1143865.
Objective:
The objective of this paper is to give an empirical comparison between supervised learning algorithms such as SVMs, neural nets,
logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted
stumps.
The main reason behind this paper is to publish comparison between these newly developed algorithms as the last comprehensive
empirical evaluation of supervised learning was the Statlog Project in the early 90’s.
The key objective of this paper is to provide the comparison between the algorithms based on variety of performance criteria such
as Precision/Recall, ROC, Lift, Accuracy, F-score, squared error, etc.
As a part of results after empirical comparison it was found that boosted trees were the best learning algorithm overall. Random
forests are close second, followed by un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that
performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. This implies that if a model is trained
using boosted trees it will give the best performance in predicting values as compared to other methods like random forest, SVM
etc.
Approach:
The model used by the authors for this paper is ADULT, COV TYPE and LETTER are from the UCI Repository (Blake & Merz, 1998).
COV TYPE has been converted to a binary problem by treating the largest class as the positive and the rest as negative. Random
5000 cases were taken as training data set and rest as testing. Also from those 5000 cases 4000 were used for training and 1000
cases to calibrate the model. Now various parameters like ROC, accuracy, lift are calculated for different algorithms and then a
column is obtained which gives mean normalized score for the eight metrics when model selection is done by cheating and looking
at the final test sets. The means in this column represent the best performance that could be achieved with each learning method if
model selection were done optimally.
Results:
From comparison it is seen that the models which perform the best can perform poorly than the average performing models. For
example, the best models on ADULT are calibrated boosted stumps, random forests and bagged trees. Boosted trees perform much
worse. Bagged trees and random forests also perform very well on MG and SLAC. On MEDIS, the best models are random forests,
neural nets and logistic regression. The only models that never exhibit excellent performance on any problem are naive bayes and
memory-based learning. that boosted trees were the best learning algorithm overall. Random forests are close second, followed by
un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes,
logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best
performance in predicting values as compared to other methods like random forest, SVM etc. The table below gives the results that
were used for the comparison of the techniques.
Summary:
It can be seen that boosted trees were the best learning algorithm overall. Random forests are close second, followed by un-
calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes,
logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best
performance in predicting values as compared to other methods like random forest, SVM etc. But this is not always the case. We
need to select the parameters carefully and then select the technique which best works for that parameter. For example
Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks,
etc. So it can be said that for medicinal area the model which gives high performance when it comes to ROC will be the best model.
Reviewed By: Omkar Deshpande
Training of Neural Networks
Reference: Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1,
June 2010
Objective:
The objective of this paper is to discuss the algorithm used in the neuralnet package and give its application in R. Also
discuss the advantages of the neuralnet package over other generalized linear models.
The main reason behind publishing this paper is to give the details of the neuralnet package developed by the authors
and also give the working example of it by using the infert dataset in R.
Artificial neural networks can be applied to approximate any complex functional relationship between input and output
variables. Unlike generalized linear models it is not necessary to pre specify the type of relationship between covariates
and response variables as for instance as linear combination. This makes artificial neural networks a valuable statistical
tool. They are in particular direct extensions of GLMs and can be applied in a similar manner.
Approach:
In this paper the authors first discuss the algorithm used in building the neuralnet package. Then the training of the
neuralnet model in R is discussed. Infert dataset is used for this purpose. The number of hidden neurons is determined
in relation to the needed complexity. A neural network with for example two hidden neurons is trained. Then the results
of backprop nnet and neuralnet are compared. Then the paper discusses the additional features such as the compute
function, confidence.interval function that come loaded in the neuralnet package.
Results:
This being an informative paper, discusses various functions available in neuralnet package and how to implement them
in R. A few comparative results provided in the paper include the comparison with the nnet package. For comparison,
neural networks are trained with the same parameter setting as above using neuralnet with algorithm="backprop" and
the package nnet. nn.bp and nn.nnet show equal results. Both training processes last only a very few iteration steps and
the error is approximately 158. Thus in this little comparison, the model fit is less satisfying than that achieved by
resilient backpropagation.
Summary:
This paper introduced multi layer perceptron and supervised learning. It also took into consideration the use of the
package neuralnet available in R for modeling functional relationships between covariates and response variables.
neuralnet contains a very flexible function that trains multilayer perceptrons to a given data set in the context of
regression analyses. It is a very flexible package since most parameters can be easily adapted. For example, the
activation function and the error function can be arbitrarily chosen and can be defined by the usual definition of
functions in R.
Reviewed By: Rahul Garg
A SVM-BASED PIPELINE LEAKAGE DETECTION AND PRE-WARNING SYSTEM
Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system",
Measurement, vol. 43, no. 4, pp. 513-519, 2010.
Objective:
This paper addresses the detection problem of pipeline leakages which may occur due to various reasons like manual
digging and illegal construction. This paper indicates the effectiveness of SVM over traditional machine learning
techniques which is based on the assumption of availability of infinite training data. The problem of gas leakages are a
concern for industries as they lead not only to huge monetary losses but also may have very tragic outcomes like
outbreak of diseases and even deaths. The timely detection of any suspected leakages can be very beneficial to the
industries as well as to general masses. The objective of this paper is to monitor and locate the possible abnormal events
(e.g. manual digging above a pipeline and illegal constructions, etc., which might cause a pipeline leakage) along pipeline
before a leakage takes place, a new pipeline leakage detection and pre-warning system. The authors of this paper have
employed SVM as the classifier to recognize these abnormal events. Three cases gas leakage, manual digging and human
walk above the pipeline were created and a series of experimental trials were used to train the model. Next, this model
was used to detect any abnormal events for classification and it provided quite accurate results. The authors found that
SVM can prove to be a lot better and accurate technique for predicting gas leakages along pipelines as compared to the
empirical risk minimization method. This implies that although SVM is comparatively a new technique but it is quite
accurate for predictive analytics in case of multi classification problems.
Approach:
The authors have followed the multi-classification approach of predictive data analytics. Since, there was no historical
data available for, the authors collected data for training the model by conducting various trials. Basically two types of
trials were done for this. One was the abnormal events identification and other abnormal events location trial. Three
cases namely gas leakage; manual digging and human walk above the pipeline were created and number of columns or
prediction terms is eight. Twenty samples were collected from each case randomly for training and ten samples were
chosen from every case to test the trained SVM model. The misclassification rate on the test data will tell, how
accurately the model has performed and whether it can be deployed in actual practices or not. For, training process
“one-against-one” method is employed. The multi-class SVM trained classifier which the authors obtained is shown in
below figure. The two axes are the first two predictors out of eight and the circled data points show the support vectors.
Results:
The detection results from SVM recognized correct cause of leakage more than 95 % of the times and locate abnormal
events quite accurately. Below photo shows the prediction results, where 1, 2 and 3 are three different categories of
abnormality. Out of the below results only sample 12 has been recognized incorrectly.
Summary:
This paper represented the problem of pipeline leakage detection which was a problem of multi classification predictive
analysis. The major take away from this review is that SVM can work quite accurately for multi classification especially in
cases where the training data is not very large like in this case of leakage detection. This technique is far better as
compared to the traditional machine learning methods like ERM. Among several methods available for multi-class
classification ‘‘one-against-one” SVM method is more suitable for practical use than others.
Reviewed By: Rahul Garg
STEEL FAULTS DIAGNOSIS USING PREDICTIVE ANALYSIS
Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International
Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13.
Objective:
The key problem which this paper discusses is the generation of various types of defects in manufactured steel plates
especially made of alloyed steel in steel industry. It is quite imperative to address this problem because rectifying these
defects by grinding or milling causes waste of time and augments the cost of production which could be prevented. This
paper aims at performing steel fault diagnosis using predictive analytics, so that the defect generation rate can be
minimized by finely tweaking the factors responsible for it. To address this problem the authors of this paper have used
classification modeling techniques namely Decision trees, Multilayer perceptron neural networks and Logistic regression
to develop a model which will diagnose the faults as accurately as possible. After developing different models it was
found that decision trees provide the best results as it has lowest misclassification rate. This implies that using Decision
Trees model could be a good option for steel fault diagnosis using data mining techniques.
Approach:
The data set used in this review has been taken from UCI repository and it classifies steel plate faults into seven different
types which makes this a case of multi classification predictive analysis. The authors of this paper have tried various
methods of classification and then selected the best method based on the misclassification rate. The methods used are
decision trees, multilayer perceptron neural networks and logistic regression. The C4.5 boosting algorithm with 10 trials
was used for decision trees. After all the models were built, and one out of the three was selected, the genetic algorithm
was used to find the best optimal solution, which works as per the following:
1. Initialize random population of n chromosomes
2. Evaluate the fitness value f(x) of each chromosome x in population
3. Create a new population by repeating following steps
 Select two parent chromosomes from given population according to fitness. (Chromosomes having better fitness
value, the bigger chance to be chosen)
 Cross over the parents to form a new offspring. If no crossover then offspring is an exact copy of the parents.
 Mutate new offspring at each locus.
 Place new offspring in the new population.
4. Use new generated population for a further execution.
5. If the end condition is fulfilled, stop, and return best solution in the current population.
6. Go to step 2
The best optimal solution chosen in this case was the solution number seven based on the output results shown by this
genetic algorithm.
Results:
The results from this review have been shown in the below table:
S No. Method Classification Accuracy Classification Error
1. Decision Tree 94.38 % 5.62 %
2. Multilayer Perceptron 83.87 % 16.13 %
3. Logistic Regression 72.64 % 27.36 %
The above table shows that out of the three classification techniques used, decision trees gave the best results as the
misclassification rate with decision trees is the least.
Summary:
This review helped me in gaining insights on the methods we can try for a multi classification predictive analytics
problem and various ways that can be used to improve those models. C4.5 algorithm can be used to improve decision
trees and pruning algorithm can be used to improve the multilayer perceptron model. Another important take away
from this review was the fact that boosted Decision Trees with C4.5 package performed the best in classifying various
steel defects out of the three models especially when the results have to be interpreted by humans.
Reviewed By: Rahul Garg
CLASSIFICATION OF EEG SIGNALS USING NEURAL NETWORK AND LOGISTIC REGRESSION
A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods
and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005.
Objective:
This paper is about the detection of epileptiform discharges in the EEG using logistic regression and artificial neural
network models. Epileptic seizure can occur in many different ways and EEG signals carry a lot of information and
accurate classification and evaluation of these signals may turn out be a breakthrough in medical science domain. This
paper aims to compare the traditional method of logistic regression to the more advanced neural network techniques, as
mathematical tools for developing classifiers for the detection of epileptic seizure in multi-channel EEG. The authors have
developed two different models using logistic regression and artificial neural networks. Multilayer perceptron neural
network (MLPNN) will be used with back propagation and Levenberg—Marquardt training algorithm. After this a comparison has
been done in both the methods. After comparing the results from both the papers, the authors concluded that the neural
network analysis proved to be a better model than the logistic regression. This implies that MLPNN is more accurate and
easier to build, as for developing logistic regression equations we start with no knowledge as to the best combination of the
parameters or the shape and degree of nonlinearity required to produce an optimal model.
Approach:
The EEG data used in this study was downloaded from 24-h EEG recorded from both epileptic patients and normal subjects. In order
to assess the performance of the classifier, 500 EEG segments were selected containing spike and wave complex, artifacts and
background normal EEG. Twenty absence seizures (petit mal) from five epileptic patients admitted for video-EEG monitoring were
analyzed. Next each of the signals was inspected by experienced neurologists to score epileptic and normal signals. After this
wavelet transform analysis was done as it captures transient features and localizes them in both time and frequency content
accurately. Next logistic regression and neural network classifiers were developed randomly selecting 300 examples out of 500
available as the training set and remaining 200 were kept for testing and validating the developed models. The selection of the
optimal network was based on monitoring the variation of error and some accuracy parameters as the network was expanded in the
hidden layer size and for each training cycle. The sum of square errors was used for choosing the optimal model and the optimum
number of nodes in the hidden layer was found to be 21. Finally after testing both the developed models, the best one was
chosen based on the misclassification error rate and sensitivity-specificity analysis. Below table shows the division of the
collected data as training and testing data.
Results:
Below table provides the results from comparing the two models on the basis of accuracy of classification and sensitivity and
specificity analysis on test data. We can clearly see that the MLPNN has more accuracy and larger area under the ROC curve.
Summary:
This paper helped in having a better understanding of the neural network analysis which is a new technique apart from the
techniques learnt in class. It provided insights on the procedure of choosing optimal number of hidden layers in the model and
limitations of logistic regression model. Another major take away from this is the evaluation and comparison of traditional logistic
regression model used for classification with a much newer multilayer perceptron neural network analysis. At last but not the least,
this paper made me aware of the wavelet transform analysis which is very effective in capturing transient features and localizing
them in both time and frequency domains.
Reviewed by: Vinayak Nair
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAULTS
DETECTION
Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults detection.
International Journal of Technical Research and Applications, 2(4), 127-131.
1. Objective:
Detection of steel plate defects is a serious problem in the industry and often, they’re performed by human operators,
which is expensive and slow. This can be tackled if the process is automated. The paper shows the application of
decision tree ensembles for fault detection. Many decision tree ensembles random subspace, bagging, adaBoost.M1 and
random forests are used to perform the steel plate fault detection and the best method for this problem is found out.
The effect of removing insignificant features is also studied. The results suggest that random AdaBoost.M1 and random
subspace are the best ensemble methods with a prediction accuracy greater than 80%.
2. Approach:
Random subspaces, bagging, adaBoost.M1 and random forests classifier ensembles were performed on the UCI dataset
and the prediction accuracies were calculated. Different selection of predictors were also tried out.
3. Results:
The classification errors for the methods with different selection of predictors are tabulated below:
The classification error with all the predictors included:
The classification error with 20 most important predictors included:
The classification error with 15 most important predictors included:
Random Subspace performed the best for the first and third method. AdaBoost.M1 came first for the second method.
When the best 20 predictors were selected, results of all the methods improved except Random Subspaces. In the third
method, the performance of models reduced, which indicated some important predictors were missed out.
4. Summary:
The single decision tree model always gave bad results in comparison to Random Subspace, Adaboost.M1, Bagging and
Random Forests, which means that we’ll have to use decision tree ensembles in the project, as well. Feature selection is
also very important as selecting the most important predictors reduces the error rate.
Reviewed by: Vinayak Nair
COMPARISON OF THREE CLASSIFICATION TECHNIQUES, CART, C4.5 AND MULTI-LAYER
PERCEPTRONS
Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons.
Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969.
1. Objective:
There are many popular algorithms such as CART (Classification and regression tree), MLP (multilayer perceptron) and
C4.5. There is a need to know how these methods compare against each other. By comparing different methods on
constrained data, we can make qualitative statements about the methods. Hence, addressing this problem can help
individuals in making less mistakes while applying a particular method to practical problems.
The key objective is to compare 3 algorithms, CART, MLP and C4.5 on classification and generalization capabilities. The
algorithms are carried out on a version of the Penzias example and the results are summarized.
It was found that generally, the MLP has better classification and generalization accuracies compared with the other two
algorithms.
2. Approach:
For comparing the classification performance, data known as the clump example ( 8th order Penzias) was used. All 256
examples were used as both training and testing data sets.
For comparing the generalization performance, the same data is used and the first 200 example are set as training set
and the rest as testing.
Parameters used: In the MLP, both the learning rate and the momentum are set at 0.1. The architecture used is: 8 input
neurons, 5 hidden layer neurons, and 4 output neurons. In CART, the prior probability is set to be equi-probable. The
pruning is performed when the probability of the leaf node is equal 0.5. In C4.5, all the default values are used.
3. Results:
Following are the classification results:
Where mlp1 and mlp2 are the values related to the MLP when it has run for 10000 iterations and 100000 iterations
respectively. The MLP accuracies improve with the number of iterations (till about 20000 iterations).
Following are the generalization results on the same data:
The generalization accuracy of the MLP is observed to be better than CART, and is comparable to C4.5.
4. Summary:
It is found that the MLP once it is converged, in general, has a better classification and generalization accuracies
compared with CART, or C4.5. On the other hand it is also noted that the prediction errors made by each algorithm are
different. This indicates that there may be a possibility of combining these algorithms in such a way that their prediction
accuracies could be improved. This is presented as a challenge for future research.
Reviewed by: Vinayak Nair
COMPARISON OF LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSIS: A
SIMULATION STUDY
Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation
Study. Metodoloski zvezki, 1(1), 143-161.
1. Objective
Linear Discriminant Analysis (LDA) and Logistic Regression (LR) are two widely used statistical methods. Though both of
them can be used to develop linear classification models, we need to have a set of guidelines for proper selection. While
LR makes no assumptions on the distribution of the explanatory data, LDA has been developed for normally distributed
explanatory variables. The appropriate method to a problem would always give better results. The objective of the
paper is understand when to choose LDA and logistic regression. The two methods are compared and performance is
studied using simulations. The results of LDA and LR were found to be close whenever the normality assumptions are
not too badly violated, and some guidelines were set for recognizing these situations. The inappropriateness of LDA in all
other cases is discussed.
2. Approach:
The simplest and the most frequently used criterion for comparison between the two methods is classification error
(percent of incorrectly classified objects; CE). However, classification error is a very insensitive and statistically inefficient
measure (Harrell, 1997). Harrell and Lee (1985) proposed four different measures of comparing predictive accuracy of
the two methods. These measures are indexes A, B, C and Q. They are better and more efficient criteria for comparisons
and they tell us how well the models discriminate between the groups and/or how good the prediction is.
Where, Pk denotes an estimate of P(Yk=1|Xk) from (2.1) and I is an indicator function ,Pi is a probability of classification
into group i, Yi is the actual group membership (1 or 0), and n is the sample size of both populations.
Random samples of size n and m from two multivariate normal populations with different mean vectors, but equal
covariance matrix Σ. The mean vector of one group is always set at (0,0). The distance to the other one is measured
using Mahalanobis distance, while the direction is set as the angle (denoted by υ) to the direction of the eigenvector of
the covariance matrix.
Each sample is then randomly divided into two parts, a training and a test sample. The coefficients of LDA and LR are
computed using the first sample and then predictions are made in the second one. The sampling experiment is
replicated 50 times. Each time the indexes for both methods are computed. Finally, the average value of indexes and the
proportion of simulations in which LR performs better are recorded.
After sampling, the normally distributed variables can be categorized, either only one or both of them. The minimum
and maximum value are computed, then the whole interval is divided into a certain number of categories of equal size.
3. Results:
The sample size has the most obvious impact on the difference between methods. LDA assumes normality and the
errors it makes in prediction are only due to the errors in estimation of the mean and variance on the sample. On the
contrary, LR adapts itself to distribution and assumes nothing about it. Therefore, in the case of small samples, the
difference between the distribution of the training sample and that of the test sample can be substantial. But, as the
sample size increases, the sampling distributions become more stable which leads to better results for the LR.
Consequently, the results of the two methods are getting closer because the populations are normally distributed.
The results from Table 1 confirm this consideration. As the sample size increases, the LDA coefficient estimations
become more accurate and therefore all four indexes are improving. The LR indexes are increasing even faster, thus
approaching those of LDA. Decreasing difference between the two methods is best presented with the Q index, which is
the most sensitive one. As the differences between index means are negligible, it is also interesting to look at the
proportion of simulations where LR performs better. It can be seen that the value of rates to which we pay special
attention that of B index and of Q index, is constantly increasing.
In the case of other changes, the results of the two methods were found to remain very close; in fact LDA is only a little
bit better than LR.
Simulations are carried out to study the effects of categorization and non-linearity, but are not presented in this
literature review due to a lack of space. However, the major takeaways from the results have been summarized in the
next section.
4. Summary:
LDA is a more appropriate method when the explanatory variables are normally distributed. In the case of categorized
variables, LDA remains preferable and fails only when the number of categories is really small (2 or 3). The results of LR,
however, are in all these cases constantly close and a little worse than those of LDA. But whenever the assumptions of
LDA are not met, the usage of LDA is not justified, while LR gives good results regardless of the distribution. As the
estimates for LR are obtained by the maximum likelihood method, they have a number of nice asymptotic properties as
well.
Project Approach
Analysis Flow Chart:
Problem Description
Propose the best model with highest prediction accuracy that can be implemented in the steel plate manufacturing
process to detect faults during the process and thus help in reducing them by taking proper preventive measures. The
assumptions are
 The data available is the exact data that is taken from the production line and has no manipulations.
 The data is not biased and is randomly selected data from different production lines (if present) and collected over a
period of time.
Given Data
The data used for this project is taken from the UCI library. This dataset consists of 7 different steel plate faults and 27
attributes which contain the features of the steel plate manufactured and also the manufacturing process.
Data Set Information:
Type of dependent variables (7 Types of Steel Plates Faults):
1.Pastry
2.Z_Scratch
3.K_Scatch
4.Stains
5.Dirtiness
6.Bumps
7.Other_Faults
Attribute Information:
27 independent variables:
X_Minimum
X_Maximum
Y_Minimum
Y_Maximum
Pixels_Areas
X_Perimeter
Y_Perimeter
Sum_of_Luminosity
Minimum_of_Luminosity
Maximum_of_Luminosity
Length_of_Conveyer
TypeOfSteel_A300
TypeOfSteel_A400
Steel_Plate_Thickness
Edges_Index
Empty_Index
Square_Index
Outside_X_Index
Edges_X_Index
Edges_Y_Index
Outside_Global_Index
LogOfAreas
Log_X_Index
Log_Y_Index
Orientation_Index
Luminosity_Index
SigmoidOfAreas
Preliminary analysis of the data:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas
1 42 50 270900 270944 267
2 645 651 2538079 2538108 108
3 829 835 1553913 1553931 71
4 853 860 369370 369415 176
5 1289 1306 498078 498335 2409
6 430 441 100250 100337 630
X_Perimeter Y_Perimeter Sum_of_Luminosity
1 17 44 24220
2 10 30 11397
3 8 19 7972
4 13 45 18996
5 60 260 246930
6 20 87 62357
Minimum_of_Luminosity Maximum_of_Luminosity
1 76 108
2 84 123
3 99 125
4 99 126
5 37 126
6 64 127
Length_of_Conveyor TypeOfSteel_A300 TypeOfSteel_A400
1 1687 1 0
2 1687 1 0
3 1623 1 0
4 1353 0 1
5 1353 0 1
6 1387 0 1
Steel_Plate_Thickness Edges_Index Empty_Index
1 80 0.0498 0.2415
2 80 0.7647 0.3793
3 100 0.9710 0.3426
4 290 0.7287 0.4413
5 185 0.0695 0.4486
6 40 0.6200 0.3417
Square_Index Outside_X_Index Edges_X_Index Edges_Y_Index
1 0.1818 0.0047 0.4706 1.0000
2 0.2069 0.0036 0.6000 0.9667
3 0.3333 0.0037 0.7500 0.9474
4 0.1556 0.0052 0.5385 1.0000
5 0.0662 0.0126 0.2833 0.9885
6 0.1264 0.0079 0.5500 1.0000
Outside_Global_Index LogOfAreas Log_X_Index Log_Y_Index
1 1 2.4265 0.9031 1.6435
2 1 2.0334 0.7782 1.4624
3 1 1.8513 0.7782 1.2553
4 1 2.2455 0.8451 1.6532
5 1 3.3818 1.2305 2.4099
6 1 2.7993 1.0414 1.9395
Orientation_Index Luminosity_Index SigmoidOfAreas Pastry
1 0.8182 -0.2913 0.5822 1
2 0.7931 -0.1756 0.2984 1
3 0.6667 -0.1228 0.2150 1
4 0.8444 -0.1568 0.5212 1
5 0.9338 -0.1992 1.0000 1
6 0.8736 -0.2267 0.9874 1
Z_Scratch K_Scratch Stains Dirtiness Bumps Other_Faults
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
The preliminary data analysis shows that
 There is at least one defect associated with every row of attributes.
 There are 1941 entries of defects in the whole dataset which is actually equal to the total number of rows in the
dataset.
 Other_Faults account for the majority of the defects; almost 35% of the defects recorded are Other_Faults. Thus it
can be fairly predicted that the misclassification error may go high for this defect.
 No two defects are associated with a single row of input. That is only one defect occurs for a particular row of
attributes in the data.
Description of new techniques used:
The new techniques chosen for this project are neural network analysis and C5.0 Decision trees.
1) Artificial Neural Networks
What is a Neural Network?
An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous
systems, such as the brain, process information.
Type of Defect Number of Occurrences
Pastry (1) 158
Z_Scratch (2) 190
K_Scratch (3) 391
Stains (4) 72
Dirtiness (5) 55
Bumps (6) 402
Other_Faults (7) 673
Components of a neuron The synapse
The figure above shows the structure of the human neural system. In the human brain, a neuron gets signals from all the
parts of the body through huge number of Dendrites. The neuron then sends signals as electrical activity through Axon.
Learning occurs by change in the energy levels of the neurons.
What is Artificial Neural Networks?
Artificial neural network is a computing system made up of a number of simple, highly interconnected processing
elements, which process information by their dynamic state response to external inputs.
The structure of a neural-network algorithm has three layers:
 The input layer feeds past data values into the next (hidden) layer. The black circles represent nodes of the
neural network.
 The hidden layer encapsulates several complex functions that create predictors; often those functions are
hidden from the user. A set of nodes (black circles) at the hidden layer represents mathematical functions that
modify the input data; these functions are called neurons.
 The output layer collects the predictions made in the hidden layer and produces the final result: the model’s
prediction.
 Neurons in a neural network can use sigmoid functions to match inputs to outputs. When used that way, a
sigmoid function is called a logistic functionand its formula looks like this:
 f(input) = 1/(1+eoutput
)
Artificial Neural Network Representation
2) C5.0 Decision Trees
Decision tree can be considered as a system that allows organizing a huge amount of information graphically.
A decision tree consists of internal nodes that represent the decisions corresponding to the hyper-planes or split points
(i.e., which half-space a given point lies in), and leaf nodes that represent regions or partitions of the data space, which
are labeled with the majority class. A region is characterized by the subset of data points that lie in that region.
One of the advantages of decision trees is that they produce models that are relatively easy to interpret. In particular, a
tree can be read as set of decision rules, with each rule’s antecedent comprising the decisions on the internal nodes
along a path to a leaf, and its consequent being the label of the leaf node. Further, because the regions are all disjoint
and cover the entire space, the set of rules can be interpreted as a set of alternatives or disjunctions.
An example of the decision tree is seen in the following figure.
C5.0 algorithm acts similar to ID3 but improves a few of ID3 behaviors:
The new features (versus ID3) are:
1) Accepts both continuous and discrete features
2) Handles incomplete data points
3) Pruning is already included in the package and thus the results are after pruning.
4) Ability to use attributes with different weights.
5) Scalability is enhanced by multi-threading; C5.0 can take advantage of computers with multiple CPUs and/or
cores
IMPLEMENTATION
Linear Discriminant Analysis
The modified multiclass dataset was modeled using Linear Discriminant Analysis to obtain the confusion matrix
and misclassification error rate for the test dataset. K-fold cross validation was also performed on the test data to
confirm our results.
>##LDA
>lda.model= lda(alldefects~., data = train2)
>lda_pred= predict(lda.model, test2)
>table(lda_pred$class, test.alldefects)
test.alldefects
A B C D E F G
A 25 0 0 0 0 1 15
B 5 50 0 0 0 4 10
C 2 0 91 0 0 0 4
D 0 0 1 26 0 0 2
E 3 0 0 0 18 0 6
F 4 2 4 0 1 67 40
G 9 5 27 1 4 39 117
>mean(lda_pred$class!= test.alldefects)
[1] 0.3241852
>##CrossValidation
> lda.cv=lda(alldefects~.,test2, CV=TRUE)
>table(lda.cv$class,test.alldefects)
test.alldefects
A B C D E F G
A 24 0 1 0 1 1 15
B 5 48 0 0 0 5 10
C 0 0 104 0 0 0 2
D 1 0 1 24 0 0 1
E 3 0 0 0 16 0 9
F 4 3 1 0 1 61 42
G 11 6 16 3 5 44 115
>mean(lda.cv$class!= test.alldefects)
[1] 0.3276158
The misclassification and cross validation error were 32.42% and 32.76% respectively.
Decision Tree
A single decision tree was then modelled on the modified dataset. The tree was also pruned to reduce the number of
branches and simplify the tree.
> ##tree
>library(tree)
> tree1=tree(train2$alldefects~.,data=train2)
>plot(tree1)
>text(tree1 ,pretty =0)
>tree.pred=predict(tree1,test2,type="class")
>table(tree.pred ,test.alldefects)
test.alldefects
tree.pred A B C D E F G
A 0 0 0 0 0 0 0
B 3 51 0 0 0 2 5
C 0 0 98 0 0 0 2
D 0 0 0 23 0 0 1
E 0 0 0 0 0 0 0
F 6 0 0 1 7 80 56
G 39 6 25 3 16 29 130
>mean(tree.pred!=test.alldefects)
[1] 0.3447684
> ##pruning
>set.seed (1)
>cv.data =cv.tree(tree1 ,FUN=prune.misclass )
>plot(cv.data$size ,cv.data$dev ,type="b")
>plot(cv.data$k ,cv.data$dev ,type="b")
>prune.data = prune.misclass(tree1 ,best =9)
>plot(prune.data)
>text(prune.data,pretty =0)
> tree.pred2=predict(prune.data , test2 ,type="class")
>table(tree.pred2 ,test.alldefects)
test.alldefects
tree.pred2 A B C D E F G
A 0 0 0 0 0 0 0
B 3 51 0 0 0 2 5
C 0 0 88 0 0 1 8
D 0 0 0 23 0 0 1
E 0 0 0 0 0 0 0
F 2 0 0 0 1 59 28
G 43 6 35 4 22 49 152
>mean(tree.pred2!=test.alldefects)
[1] 0.3602058
The misclassification error rates obtained from the original and the pruned tree were 34.5% and 36% respectively. The
error rate did not increase a lot thus justifying pruning to make the decision tree more readable.
Bagging
Bagging was used on the dataset to reduce the variance obtained in a decision tree model by averaging the
observations and also effectively increasing the training datasets via bootstrapping.
> ## bAGGING
>set.seed (1)
>library(randomForest)
>bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE)
>yhat.bag = predict (bag.data ,test2)
>plot(yhat.bag , test.alldefects)
>abline (0,1)
>table(yhat.bag, test.alldefects)
test.alldefects
yhat.bag A B C D E F G
A 30 0 0 0 0 5 7
B 0 50 0 0 0 1 0
C 0 0 112 0 0 0 1
D 0 0 0 24 0 0 1
E 1 0 0 0 19 1 2
F 4 0 0 1 2 76 32
G 13 7 11 2 2 28 151
>mean(yhat.bag!=test.alldefects)
[1] 0.2075472
The misclassification error rate obtained was 20.75%.
Random Forest
Random Forest bootstrapping method was applied on the modified dataset to de-correlate the bagged trees
thus further reducing the variance.
> #randomforest
>set.seed (1)
>rf =randomForest(alldefects~.,data=train2 , importance =TRUE)
>yhat.rf = predict (rf ,test2)
>table(yhat.rf, test.alldefects)
test.alldefects
yhat.rf A B C D E F G
A 25 0 0 0 0 6 5
B 1 50 0 0 0 0 4
C 0 0 112 0 0 0 1
D 0 0 0 24 0 0 1
E 0 0 0 0 19 0 3
F 3 1 0 1 2 73 33
G 19 6 11 2 2 32 147
>mean(yhat.rf !=test.alldefects)
[1] 0.2281304
The misclassification error rate obtained was 22.81% a bit higher than bagging but it ensures a lower variance which is generally
better to implement on future data points.
C5.0
An advanced decision tree technique known as C5.0 was also used to model the modified dataset.
> #C50
>crx<- data[ sample( nrow( data ) ), ]
> X <- crx[,1:27]
> y <- crx[,35]
>trainx<- X[1:1358,]
>trainy<- y[1:1358]
>testx<- X[1358:1941,]
>testy<- y[1358:1941]
>model<- C5.0( trainx, trainy, trials=75 )
> p <- predict( model, testx, type="class" )
>table(p, testy)
testy
p A B C D E F G
A 32 0 0 0 1 2 6
B 1 39 0 0 0 1 4
C 0 0 111 0 0 1 3
D 0 0 0 26 0 2 0
E 1 0 0 0 14 2 2
F 11 0 0 1 0 96 19
G 15 1 7 1 1 31 153
>mean(p != testy)
[1] 0.1934932
The misclassification error rate obtained was 19.35%.
Support Vector Machines
SVM was tried on the training data set for different values of ‘C’ and the best results came out with C=15.
> svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15)
>summary(svm.fit)
Call:
svm(formula = alldefects ~ ., data = train2, type = "C", kernel = "polynomial",degree =
3, cost = 15)
Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 15
degree: 3
gamma: 0.03703703704
coef.0: 0
Number of Support Vectors: 819
( 43 225 347 87 74 18 25 )
Number of Classes: 7
Levels:
A B C D E F G
The plot for SVM on the training data has been shown in below figure. It is a 2-D plot with its axes as Edges_X _Index
and Edges_Y_Index. The circular symbols show the data points and the triangles show support vectors.
> predicted=predict(svm.fit,test2)
>table(predicted,test.alldefects)
test.alldefects
predicted A B C D E F G
A 28 1 0 0 0 6 7
B 0 49 0 0 0 3 6
C 0 1 113 0 0 0 4
D 0 0 0 25 0 0 0
E 0 0 0 0 17 0 3
F 6 0 3 1 5 69 53
G 14 6 7 1 1 33 121
>mean(predicted!=test.alldefects)
[1] 0.2761578045
The misclassification rate for SVM on testing data is about 27.61%.
Artificial Neural Networks
For this project artificial neural network model is developed by using different methods. First a model is tried using nnet
function in the nnet library in R. The second model is created using the multi level perceptron. This code uses the mlp
function provided in the RSNNS library in R.
1st
Method
In this method the model was built by using artificial neural network using the nnet function from the nnet library in R.
Many attributions were tried by changing the number of hidden layers expresses as size in the code. Also rang, decay
and matix were changed to get a lower misclassification rate. Finally the best model had 20 hidden layers and other
attributes as seen in the code.
train.nnet<-
nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000)
# weights: 707
initial value 2736.596055
iter 10 value 2082.470121
iter 20 value 2006.626658
iter 30 value 1963.775429
iter 40 value 1907.670254
iter 50 value 1901.104216
iter 60 value 1841.389091
iter 70 value 1815.725249
iter 80 value 1804.856698
iter 90 value 1801.382263
iter 100 value 1801.021638
iter 110 value 1797.549455
iter 120 value 1797.305184
iter 130 value 1797.183004
iter 140 value 1796.918336
iter 150 value 1795.256115
iter 160 value 1793.025804
final value 1792.714314
converged
test.nnet<-predict(train.nnet,test2,type=("class"))
table(test2$alldefects,test.nnet)
test.nnet
1 3 7
1 0 5 43
2 0 7 50
3 0 98 25
4 0 0 27
5 0 1 22
6 0 6 105
7 1 22 171
mean(test.nnet!=test2$alldefects)
[1] 0.5385935
The misclassification rate for ANN on testing data is about 53.8%.
2nd
method
In this method the artificial neural network model was built by using the mlp function available in RSNNS library in R.
This multi layer perceptron takes the predictors, responses, number of hidden layers as input.
> model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F)
> test.cl(y[-samp,], predict(model, x[-samp,]))
cres
true 3 7
1 3 42
2 2 52
3 50 63
4 0 25
5 0 16
6 3 110
7 16 200
> test.cl(y[samp,],fitted.values(model))
cres
true 3 7
1 5 108
2 6 130
3 119 159
4 0 47
5 4 35
6 4 285
7 33 424
The misclassification rate for ANN on testing data is about 60.01%.
It is seen that the misclassification rate of the model built by using artificial neural networks is very high. The least error
rate achieved is by using the nnet package, which is about 52%.
Logistic regression
We developed 7 different logistic models for classification of each type of defect as we noticed that each defect were
reliant on a different set of predictors. We aim to develop a hierarchical model in which we detect if a defect is
present or not; if present, we end the program (Dataset implies that each steel plate has only one kind of defect); and
if absent, we continue and check the presence of the next type of defect. And so on.
Logistic regression model for the first type of defect: Pastry
>train_pastry=train[,-c(29,30,31,32,33,34,35)]
>fix(train_pastry)
>test_pastry= test[,-c(29,30,31,32,33,34,35)]
>log_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orienta
tion_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minim
um_of_Luminosity,data=train_pastry,family = "binomial")
>summary(log_pastry)
Call:
glm(formula = Pastry ~ LogOfAreas + TypeOfSteel_A300 + Sum_of_Luminosity +
Log_X_Index + Square_Index + Orientation_Index + Log_Y_Index +
Maximum_of_Luminosity + X_Maximum + X_Minimum + Length_of_Conveyor +
Minimum_of_Luminosity, family = "binomial", data = train_pastry)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.01159 -0.26886 -0.05515 0.00000 3.08897
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.178e+01 3.366e+00 -3.500 0.000465 ***
LogOfAreas 4.551e+00 2.015e+00 2.258 0.023940 *
TypeOfSteel_A300 -5.827e-01 3.077e-01 -1.894 0.058260 .
Sum_of_Luminosity 7.952e-06 2.324e-06 3.422 0.000621 ***
Log_X_Index 1.377e+01 6.238e+00 2.207 0.027308 *
Square_Index -4.378e+00 1.186e+00 -3.692 0.000223 ***
Orientation_Index 4.450e+00 1.267e+00 3.512 0.000445 ***
Log_Y_Index -9.260e+00 2.309e+00 -4.010 6.07e-05 ***
Maximum_of_Luminosity 3.396e-02 9.677e-03 3.510 0.000449 ***
X_Maximum -7.516e-01 2.283e-01 -3.292 0.000995 ***
X_Minimum 7.521e-01 2.283e-01 3.294 0.000987 ***
Length_of_Conveyor 3.830e-03 8.421e-04 4.548 5.41e-06 ***
Minimum_of_Luminosity -4.196e-02 8.145e-03 -5.151 2.59e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 763.76 on 1357 degrees of freedom
Residual deviance: 427.59 on 1345 degrees of freedom
AIC: 453.59
Number of Fisher Scoring iterations: 14
>log_pastry_pred = predict(log_pastry, test_pastry, type ="response")
>log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment
>log_pastry_pred_y[log_pastry_pred> 0.5]= 1
>table(log_pastry_pred_y, test_pastry[,28])
log_pastry_pred_y 0 1
0 528 33
1 7 15
>mean(log_pastry_pred_y != test_pastry[,28])
[1] 0.06861063
We see that the misclassification error rate is less than 7% which is acceptable for this
individual model. We have also cross-validated these results using K-Fold cross validation
Technique.
># cross validation
>cv_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orienta
tion_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minim
um_of_Luminosity,data=train_pastry,family = "binomial")
>cv.glm(train_pastry,cv_pastry,K=10)$delta[1]
1] 0.06969064
The misclassification error rate obtained was 6.97%.
Similarly, individual classification models were developed for rest of the defects and the results obtained are
tabulated below:
The combined accuracy of the hierarchical model would be = (1-0.068)*(1- 0.041)*(1-0.045)*(1-0.0189)*(1-0.031)*(1-
0.16)*(1-0.252) = 0.51= 51%
* Since the defects are independent of each other the probabilities of the individual models being right will be
multiplied.
Random Forest
Individual responses were then modeled with random forest to get the respective error rates.
> ##randomforest with Pastry only
>train_pastry$Pastry=factor(train_pastry$Pastry)
>test_pastry$Pastry=factor(test_pastry$Pastry)
>rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE)
>yhat.rf_pastry = predict (rf_pastry ,test_pastry)
>table(yhat.rf_pastry, test_pastry[,28])
yhat.rf_pastry 0 1
0 529 32
1 6 16
>mean(yhat.rf_pastry!=test_pastry[,28])
[1] 0.0651801
The misclassification error rate obtained was 6.52%.
Defect Confusion matrix Error rate CV error
log_pastry_pred_y 0 1
0 528 33
1 7 15
log_zs_pred_y 0 1
0 511 9
1 15 48
log_ks_pred_y 0 1
0 453 19
1 7 104
log_stains_pred_y 0 1
0 554 9
1 2 18
log_dirt_pred_y 0 1
0 554 12
1 6 11
log_bumps_pred_y 0 1
0 447 68
1 25 43
log_of_pred_y 0 1
0 346 104
1 43 90
0.176
0.069
0.030
0.018
0.020
0.015
0.124
All Faults
0.068
0.041
0.045
0.0189
0.031
0.16
0.252
Pastry
Z_Scratch
K_Scratch
Stains
Dirtiness
Bumps
Similarly, random forest models were developed for other individual defects with the
results tabulated below:
The combined accuracy of the hierarchical model would be = (1-0.065)*(1- 0.022)*(1-0.0257)*(1-0.005)*(1-0.0189)*(1-
0.111)*(1-0.17) = 0.6417= 64.17%
* Since the defects are independent of each other the probabilities of the individual models being right will be
multiplied.
Principal Component Analysis
A dimensional reduction technique was conducted on the dataset to apply the 80/20 rule and extract the “Vital Few”
from the “Trivial Many” prediction terms.
>#PCA on complete data set
>datap=data[,-(28:35)]
>fit <- princomp(datap, cor=TRUE)
>summary(fit) # print variance accounted for
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Comp.7 Comp.8 Comp.9
Standard deviation 2.8815442 1.8493487 1.6443472 1.49596665 1.40409954 1.27423486
1.17387303 0.99985628 0.96006830
Defect Confusion matrix Error rate
yhat.rf_pastry 0 1
0 529 32
1 6 16
yhat.rf_zs 0 1
0 522 9
1 4 48
yhat.rf_ks 0 1
0 459 14
1 1 109
yhat.rf_stains 0 1
0 556 3
1 0 24
yhat.rf_dirt 0 1
0 558 9
1 2 14
yhat.rf_bumps 0 1
0 460 53
1 12 58
yhat.rf_of 0 1
0 363 73
1 26 121
Other Faults
0.065
0.022
0.0257
0.0051
0.0189
0.111
0.17
Pastry
Z_Scratch
K_Scratch
Stains
Dirtiness
Bumps
Proportion of Variance 0.3075295 0.1266700 0.1001436 0.08288579 0.07301835 0.06013609
0.05103622 0.03702639 0.03413819
Cumulative Proportion 0.3075295 0.4341995 0.5343431 0.61722894 0.69024729 0.75038338
0.80141960 0.83844599 0.87258418
Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
Comp.16 Comp.17
Standard deviation 0.88369299 0.84586524 0.73974691 0.62701635 0.54299087 0.489154397
0.434472983 0.316363504
Proportion of Variance 0.02892271 0.02649956 0.02026761 0.01456109 0.01091997 0.008861927
0.006991362 0.003706884
Cumulative Proportion 0.90150689 0.92800645 0.94827406 0.96283515 0.97375512 0.982617047
0.989608409 0.993315293
Comp.18 Comp.19 Comp.20 Comp.21 Comp.22
Comp.23 Comp.24 Comp.25
Standard deviation 0.243539500 0.235660669 0.211674618 0.1093589373 0.0837014355
3.693381e-02 2.217081e-02 3.543139e-03
Proportion of Variance 0.002196722 0.002056887 0.001659487 0.0004429399 0.0002594789
5.052246e-05 1.820536e-05 4.649569e-07
Cumulative Proportion 0.995512015 0.997568902 0.999228388 0.9996713283 0.9999308072
9.999813e-01 9.999995e-01 1.000000e+00
Comp.26 Comp.27
Standard deviation 5.417823e-06 1.218954e-08
Proportion of Variance 1.087141e-12 5.503141e-18
Cumulative Proportion 1.000000e+00 1.000000e+00
>plot(fit,type="lines") # scree plot
>biplot(fit)
From the analysis and the scree plot first 7 principal components were selected as they seemed to explain just over
80% of the variability in the sample space.
The principal components were then extracted and were stored in two different files:
 One with only the top 7 principal components
 Another with the original data and the top 7 principal components combined
>axes<- predict(fit, newdata = datap)
>fix(axes)
> data1=axes[,1:7]
>fix(data1)
>write.csv(data1,file="pcadata.csv") #data file with the top 7 PCs
>data2=data.frame(data,data1)
>write.csv(data2,file="comb_data.csv")#data file with the original data and the 7 PCs combined
These two data files were used for further modelling.
Model Formation with principal components
Logistic Regression
Logistic Regression was performed on the extracted principal components for individual responses.
Logistic regression model for the first type of defect: Pastry (Using the PCs)
>#######################logistic regression
> #00000000000000000000_pastry
>train_pastry=train[,-c(9,10,11,12,13,14,15)]
>fix(train_pastry)
>test_pastry= test[,-c(9,10,11,12,13,14,15)]
>log_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")
>summary(log_pastry)
Call:
glm(formula = Pastry ~ ., family = "binomial", data = train_pastry)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5477 -0.4039 -0.1326 -0.0243 3.5952
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.51776 0.38758 -11.656 < 2e-16 ***
Comp.1 -0.63814 0.11686 -5.461 4.75e-08 ***
Comp.2 -1.17226 0.14334 -8.178 2.88e-16 ***
Comp.3 -0.37291 0.08243 -4.524 6.07e-06 ***
Comp.4 -0.25626 0.08068 -3.176 0.00149 **
Comp.5 0.42288 0.07491 5.645 1.65e-08 ***
Comp.6 0.17290 0.09726 1.778 0.07546 .
Comp.7 0.40363 0.10328 3.908 9.31e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 763.76 on 1357 degrees of freedom
Residual deviance: 541.13 on 1350 degrees of freedom
AIC: 557.13
Number of Fisher Scoring iterations: 8
>log_pastry_pred = predict(log_pastry, test_pastry, type ="response")
>log_pastry_pred_y = rep(0, length(test_pastry[,8])) # default assignment
>log_pastry_pred_y[log_pastry_pred> 0.5]= 1
>table(log_pastry_pred_y, test_pastry[,8])
log_pastry_pred_y 0 1
0 529 44
1 6 4
>mean(log_pastry_pred_y != test_pastry[,8])
[1] 0.08576329
> # cross validation
>cv_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")
>cv.glm(train_pastry,cv_pastry,K=10)$delta[1]
[1] 0.06398761
Similarly, individual classification models were developed for rest of the defects and the results obtained are
tabulated below:
The combined accuracy of the hierarchical model would be = (1-0.0857)*(1- 0.0857)*(1-0.0634)*(1-0.0172)*(1-
0.0446)*(1-0.19)*(1-0.285) = 0.4258= 42.58%
* Since the defects are independent of each other the probabilities of the individual models being right will be
multiplied.
Random Forest
The dataset consisting of the principal components was then used with a Random Forest model.
> #randomforest with important predictors
>set.seed (1)
> #randomforest with Pastry only
>set.seed (1)
>train_pastry$Pastry=factor(train_pastry$Pastry)
>test_pastry$Pastry=factor(test_pastry$Pastry)
Defect Confusion matrix Error rate CV error
log_pastry_pred_y 0 1
0 529 44
1 6 4
log_zs_pred_y 0 1
0 506 30
1 20 27
log_ks_pred_y 0 1
0 452 29
1 8 94
log_stains_pred_y 0 1
0 553 7
1 3 20
log_dirt_pred_y 0 1
0 557 23
1 3 0
log_bumps_pred_y 0 1
0 447 86
1 25 25
log_of_pred_y 0 1
0 344 121
1 45 73
Other Faults 0.285
0.064
0.0525
0.0257
0.0157
0.0224
0.14
0.198
Stains 0.0172
Dirtiness 0.0446
Bumps 0.19
Pastry 0.0857
Z_Scratch 0.0857
K_Scratch 0.0634
>rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE)
>yhat.rf_pastry = predict (rf_pastry ,test_pastry)
>table(yhat.rf_pastry, test_pastry[,8])
yhat.rf_pastry 0 1
0 529 38
1 6 10
>mean(yhat.rf_pastry!=test_pastry[,8])
[1] 0.0754717
The misclassification error rate obtained was 7.54%.
With the model implementation showing that random forest was giving better results (and rightly so) it was decided
to model all the individual responses with random forest using both the datasets (with only the 7 PCs and the one
with the original predictors + 7 PCs).
The results obtained are tabulated below.
S No. Typeofdefect using 1stsevenPC's using all27predictors
Using allpredictorsand7
PC's
1 Pastry(A) 0.075 0.065 0.065
2 Z-scratch(B) 0.046 0.022 0.024
3 K-Scratch(C) 0.036 0.026 0.027
4 Stains(D) 0.012 0.005 0.005
5 Dirtiness(E) 0.019 0.019 0.015
6 Bumps(F) 0.042 0.111 0.127
7 OtherDefects(G) 0.196 0.170 0.168
0.635 0.641 0.631
Misclassificationerrorratesfor individualRANDOMFORESTSwithdifferentpredictionterms
Accuracy forthecombinedmodel
RESULTS
ROC Analysis
ROC analysis was conducted on the different multiclass models to aid us in selecting the best model.
S No.
Modelling
Technique
Confusion Matrix
Missclassification
Error rate
ROC curve AUC*
1 LDA 0.324 0.065 0.790
2
Decision
Tree (After
Pruning)
0.360 0.784
3 Bagging 0.208 0.824
4
Random
Forest
0.228 0.797
5 SVM 0.276 0.804
6
Neural
Network
Analysis
0.539 0.605
7 C 5.0 0.194 0.831
Comparision of different Multiclass Models
The C5.0 Decision Tree had the best performance on the testing dataset.
ROC Analysis was also conducted for the logistic regression and random forest models that were developed for each
individual defects, the results are tabulated below:
ROC for Individual defects using Logistic Regression:
ROC Curves
AUC Values
Logistic reg
AUC
Other Faults
0.65 0.91 0.92 0.83 0.73 0.67 0.68
Pastry Z_Scratch K_Scratch Stains Dirtiness Bumps
ROC for individual defects using Random Forest
ROC Curves
AUC Values
Random Forest
AUC 0.67 0.68
K_Scratch Stains Dirtiness Bumps Other Faults
0.65 0.91 0.92 0.83 0.73
Pastry Z_Scratch
CONCLUSION
The major takeaways from this project were:
 Advanced decision trees such as C5.0 DT and Random Forest are the most efficient techniques in dealing with
multiclass anomaly detection using machine learning.
 Although modelling for individual defects gives a very high accuracy rate for almost every one of them, the
combined hierarchical model that will utilize them in practice will not be as efficient due to the fact that its
accuracy is a product of the individual accuracies of the models used in the hierarchical model.
 Logistic Regression although a very powerful tool doesn’t seem to be a good fit for multiclass anomaly detection
problems due to the fact that the logistic regression model does not predict the type of defect but rather the
probability of that defect occurring using the log likelihood function.
 SVM also turned out as a good tool for conducting multi classification as its accuracy rate was very high, but we
still prefer C5.0 over SVM as it was not the best because of very high number of support vectors.
 Artificial Neural Network results on this dataset were not satisfactory with very high misclassification rate. This
can be because of many reasons but the major reason is a small dataset. Also there is no method to fine tune
the number of hidden layers and a slight change in the number of hidden layer causes significant
misclassification. So it can be said that either the model was right value of parameters even after a lot of trial
and errors or the model was not trained properly because of the small dataset.
Future Scope:
The dataset considered in this project uses multiclass classification techniques as the defects are not co-related. Also, as
the defects are not co-related some techniques are performed in a multi univariate technique that is using different
models for each fault. Now, Multi label classification is where same predictor values are causing 2 or more defects at a
time which is not the case in this particular dataset thus not required and hence not used. Thus this project and the
results are limited to multi class classification when the faults are not co-related. So for further data if the defects
become co-related, multi-label classification has to be used which becomes the future scope of this project.
References
-Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system",
Measurement, vol. 43, no. 4, pp. 513-519, 2010.
Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International
Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13.
A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods
and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005.
Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science,
vol. 8, no. 4, pp. 506-514, 2012.
S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International
Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013.
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms. Proceedings Of
The 23Rd International Conference On Machine Learning - ICML '06. http://dx.doi.org/10.1145/1143844.1143865
Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1, June 2010
Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults
detection. International Journal of Technical Research and Applications, 2(4), 127-131.
Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons.
Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969.
Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation
Study.Metodoloski zvezki, 1(1), 143-161.
Neural Network Primer: Part I" by Maureen Caudill, AI Expert, Feb. 1989
http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf
http://saiconference.com/Downloads/SpecialIssueNo10/Paper_3A_comparative_study_of_decision_tree_ID3_and_C4.5
.pdf
www.cs.princeton.edu
APPENDIX
1. On Original Dataset
library(ISLR)
library(boot)
library(MASS)
data=read.csv(file.choose(), header=T)
attach(data)
data$alldefects="A"
for(i in 1:1941) {
if ( Z_Scratch[i]==1) {data$alldefects[i]="B"}
if ( K_Scratch[i]==1) {data$alldefects[i]="C"}
if ( Stains[i]==1) {data$alldefects[i]="D"}
if ( Dirtiness[i]==1) {data$alldefects[i]="E"}
if ( Bumps[i]==1) {data$alldefects[i]="F"}
if ( Other_Faults[i]==1) {data$alldefects[i]="G"} }
data$alldefects=factor(data$alldefects)
set.seed(1)
trainingsample=sample(1:nrow(data), size=0.70*nrow(data))
train=data[trainingsample,]
test=data[-trainingsample,]
write.csv(train,file="exportedtrainingdata.csv")
write.csv(test,file="exportedtestingdata.csv")
train2=train[,-(28:34)]
test2=test[,-(28:34)]
test.alldefects=test2[,28]
#LDA
lda.model= lda(alldefects~., data = train2)
lda_pred= predict(lda.model, test2)
table(lda_pred$class, test.alldefects)
mean(lda_pred$class!= test.alldefects)
mean(lda_pred$class== test.alldefects)
lda.cv=lda(alldefects~.,test2, CV=TRUE)
table(lda.cv$class,test.alldefects)
mean(lda.cv$class!= test.alldefects)
predictions <- as.numeric(lda_pred$class, type="response")
multiclass.roc(test.alldefects, predictions, plot=T)
y=rep(0,length(lda_pred$class))
y[lda_pred$class==test.alldefects]=1
x=rep(0,length(test.alldefects))
x[test.alldefects==test.alldefects]=1
roc(x,y,plot=TRUE,main="LDA")
predictions_lda <- as.numeric(lda_pred,type="vote")
multiclass.roc(test.alldefects, predictions_lda, plot=T)
#qda
qda.model= qda(alldefects~., data = train2)
qda_pred= predict(qda.model, test2)
table(qda_pred$class, test.alldefects)
mean(qda_pred$class!= test.alldefects)
##tree
library(tree)
tree1=tree(train2$alldefects~.,data=train2)
plot(tree1)
text(tree1 ,pretty =0)
tree.pred=predict(tree1,test2,type="class")
table(tree.pred ,test.alldefects)
mean(tree.pred!=test.alldefects)
predictions_tree <- as.numeric(tree.pred,type="response")
multiclass.roc(test.alldefects, predictions_tree, plot=T)
##pruning
set.seed (1)
cv.data =cv.tree(tree1 ,FUN=prune.misclass )
names(cv.data)
cv.data
par(mfrow =c(1,1))
plot(cv.data$size ,cv.data$dev ,type="b")
plot(cv.data$k ,cv.data$dev ,type="b")
prune.data = prune.misclass(tree1 ,best =9)
plot(prune.data)
text(prune.data,pretty =0)
tree.pred2=predict(prune.data , test2 ,type="class")
table(tree.pred2 ,test.alldefects)
mean(tree.pred2!=test.alldefects)
predictions_tree <- as.numeric(tree.pred2,type="response")
multiclass.roc(test.alldefects, predictions_tree, plot=T)
## bAGGING
set.seed (1)
bag.data
=randomForest(alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickness
+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , mtry=10,importance =TRUE)
bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE)
bag.data
yhat.bag = predict (bag.data ,test2)
plot(yhat.bag , test.alldefects)
abline (0,1)
table(yhat.bag, test.alldefects)
mean( yhat.bag!=test.alldefects)
predictions_bag <- as.numeric(yhat.bag,type="response")
multiclass.roc(test.alldefects, predictions_bag, plot=T)
#randomforest
set.seed (1)
library(randomForest)
rf =randomForest(alldefects~.,data=train2 , importance =TRUE)
yhat.rf = predict (rf ,test2)
table(yhat.rf, test.alldefects)
mean( yhat.rf !=test.alldefects)
predictions <- as.numeric(predict(rf, test2, type = 'response'))
multiclass.roc(test.alldefects, predictions, plot=T)
#randomforest with important predictors
set.seed (1)
library(randomForest)
rrf
=randomForest(alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickness
+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , importance =TRUE)
yhat.rrf = predict (rrf ,test2)
table(yhat.rrf, test.alldefects)
mean( yhat.rrf !=test.alldefects)
#randomforest with Pastry only
set.seed (1)
train_pastry$Pastry=factor(train_pastry$Pastry)
test_pastry$Pastry=factor(test_pastry$Pastry)
rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE)
yhat.rf_pastry = predict (rf_pastry ,test_pastry)
table(yhat.rf_pastry, test_pastry[,28])
mean( yhat.rf_pastry!=test_pastry[,28])
#randomforest with z_scratch only
set.seed (1)
train_zs$Z_Scratch=factor(train_zs$Z_Scratch)
test_zs$Z_Scratch=factor(test_zs$Z_Scratch)
rf_zs =randomForest(Z_Scratch~.,data=train_zs, importance =TRUE)
yhat.rf_zs = predict (rf_zs ,test_zs)
table(yhat.rf_zs, test_zs[,28])
mean( yhat.rf_zs!=test_zs[,28])
#randomforest with K_scratch only
set.seed (1)
train_ks$K_Scratch=factor(train_ks$K_Scratch)
test_ks$K_Scratch=factor(test_ks$K_Scratch)
rf_ks =randomForest(K_Scratch~.,data=train_ks, importance =TRUE)
yhat.rf_ks = predict (rf_ks ,test_ks)
table(yhat.rf_ks, test_ks[,28])
mean( yhat.rf_ks!=test_ks[,28])
#randomforest with stains only
set.seed (1)
train_stains$Stains=factor(train_stains$Stains)
test_stains$Stains=factor(test_stains$Stains)
rf_stains =randomForest(Stains~.,data=train_stains, importance =TRUE)
yhat.rf_stains = predict (rf_stains ,test_stains)
table(yhat.rf_stains, test_stains[,28])
mean( yhat.rf_stains!=test_stains[,28])
#randomforest with dirt only
set.seed (1)
train_dirt$Dirtiness=factor(train_dirt$Dirtiness)
test_dirt$Dirtiness=factor(test_dirt$Dirtiness)
rf_dirt =randomForest(Dirtiness~.,data=train_dirt, importance =TRUE)
yhat.rf_dirt = predict (rf_dirt ,test_dirt)
table(yhat.rf_dirt, test_dirt[,28])
mean( yhat.rf_dirt!=test_dirt[,28])
#randomforest with bumps only
set.seed (1)
train_bumps$Bumps=factor(train_bumps$Bumps)
test_bumps$Bumps=factor(test_bumps$Bumps)
rf_bumps =randomForest(Bumps~.,data=train_bumps, importance =TRUE)
yhat.rf_bumps = predict (rf_bumps ,test_bumps)
table(yhat.rf_bumps, test_bumps[,28])
mean( yhat.rf_bumps!=test_bumps[,28])
#randomforest with other faults only
set.seed (1)
train_of$Other_Faults=factor(train_of$Other_Faults)
test_of$Other_Faults=factor(test_of$Other_Faults)
rf_of =randomForest(Other_Faults~.,data=train_of, importance =TRUE)
yhat.rf_of = predict (rf_of ,test_of)
table(yhat.rf_of, test_of[,28])
mean( yhat.rf_of!=test_of[,28])
rf.cv=randomForest(train_of$Other_Faults~.,data=train_of, CV=TRUE)
table(rf.cv$class,train_of[,8])
r = randomForest(alldefects~., data = cadets, importance =TRUE, do.trace = 100)
varImpPlot(r)
r###################################################################logistic regression
#####################################################
#000000000000000000000000000000000000000000000000_pastry
train_pastry=train[,-c(29,30,31,32,33,34,35)]
fix(train_pastry)
test_pastry= test[,-c(29,30,31,32,33,34,35)]
attach(train_pastry)
attach(test_pastry)
log_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y
_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train
_pastry,family = "binomial")
summary(log_pastry)
log_pastry_pred = predict(log_pastry, test_pastry, type ="response")
log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment
log_pastry_pred_y[log_pastry_pred> 0.5]= 1
table(log_pastry_pred_y, test_pastry[,28])
mean(log_pastry_pred_y != test_pastry[,28])
#ROC
y=rep(0,length(log_pastry_pred_y))
y[log_pastry_pred_y==1]=1
x=rep(0,length(test_pastry[,28]))
x[test_pastry[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ PASTRY")
# cross validation
cv_pastry =
glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y
_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train
_pastry,family = "binomial")
cv.glm(train_pastry,cv_pastry,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_zs
train_zs=train[,-c(28,30,31,32,33,34,35)]
fix(train_zs)
test_zs= test[,-c(28,30,31,32,33,34,35)]
attach(train_zs)
attach(test_zs)
log_zs =
glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum
+Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino
sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial")
summary(log_zs)
log_zs_pred = predict(log_zs, test_zs, type ="response")
log_zs_pred_y = rep(0, length(test_zs[,28])) # default assignment
log_zs_pred_y[log_zs_pred> 0.5]= 1
table(log_zs_pred_y, test_zs[,28])
mean(log_zs_pred_y != test_zs[,28])
#ROC
y=rep(0,length(log_zs_pred_y))
y[log_zs_pred_y==1]=1
x=rep(0,length(test_zs[,28]))
x[test_zs[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Z_skratch")
#CV
log_zs=step(glm(Z_Scratch~.,data=train_zs,family="binomial"),direction="backward")
cv_zs =
glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum
+Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino
sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial")
cv.glm(train_zs,cv_zs,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_ks
train_ks=train[,-c(28,29,31,32,33,34,35)]
test_ks= test[,-c(28,29,31,32,33,34,35)]
attach(train_ks)
attach(test_ks)
log_ks =
glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+
X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index
+Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family =
"binomial")
summary(log_ks)
log_ks_pred = predict(log_ks, test_ks, type ="response")
log_ks_pred_y = rep(0, length(test_ks[,28])) # default assignment
log_ks_pred_y[log_ks_pred> 0.5]= 1
table(log_ks_pred_y, test_ks[,28])
mean(log_ks_pred_y != test_ks[,28])
log_ks=step(glm(K_Scratch~.,data=train_ks,family="binomial"),direction="backward")
#ROC
y=rep(0,length(log_ks_pred_y))
y[log_ks_pred_y==1]=1
x=rep(0,length(test_ks[,28]))
x[test_ks[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ k_skratch")
#CV
cv_ks =
glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+
X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index
+Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family =
"binomial")
cv.glm(train_ks,cv_ks,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_Stains
train_stains=train[,-c(28,29,30,32,33,34,35)]
test_stains= test[,-c(28,29,30,32,33,34,35)]
log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed
ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max
imum_of_Luminosity+Outside_X_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,
data=train_stains,family="binomial")
log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed
ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max
imum_of_Luminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data=
train_stains,family = "binomial")
summary(log_stains)
log_stains_pred = predict(log_stains, test_stains, type ="response")
log_stains_pred_y = rep(0, length(test_of[,28])) # default assignment
log_stains_pred_y[log_stains_pred> 0.5]= 1
table(log_stains_pred_y, test_stains[,28])
mean(log_stains_pred_y != test_stains[,28])
log_stains=step(glm(Stains~.,data=train_stains,family="binomial"),direction="backward")
#ROC
y=rep(0,length(log_stains_pred_y))
y[log_stains_pred_y==1]=1
x=rep(0,length(test_stains[,28]))
x[test_stains[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Stains")
# cross validation
cv_stains =
glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Edges_Index+
LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Maximum_of_L
uminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data=train_stain
s,family = "binomial")
cv.glm(train_stains,cv_stains,K=10)$delta[1]
#000000000000000000000000000000000000000000000000_Dirtiness
train_dirt=train[,-c(28,29,30,31,33,34,35)]
test_dirt= test[,-c(28,29,30,31,33,34,35)]
log_dirt =
glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M
aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial")
summary(log_dirt)
log_dirt_pred = predict(log_dirt, test_dirt, type ="response")
log_dirt_pred_y = rep(0, length(test_dirt[,28])) # default assignment
log_dirt_pred_y[log_dirt_pred> 0.5]= 1
table(log_dirt_pred_y, test_dirt[,28])
mean(log_dirt_pred_y != test_dirt[,28])
log_dirt=step(glm(Dirtiness~.,data=train_dirt,family="binomial"),direction="backward")
# cross validation
cv_dirt =
glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M
aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial")
cv.glm(train_dirt,cv_dirt,K=10)$delta[1]
y=rep(0,length(log_dirt_pred_y))
y[log_dirt_pred_y==1]=1
x=rep(0,length(test_dirt[,28]))
x[test_dirt[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Dirtiness")
#000000000000000000000000000000000000000000000000_Bumps
train_bumps=train[,-c(28,29,30,31,32,34,35)]
test_bumps= test[,-c(28,29,30,31,32,34,35)]
log_bumps =
glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate
_Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+
Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial")
summary(log_bumps)
log_bumps_pred = predict(log_bumps, test_bumps, type ="response")
log_bumps_pred_y = rep(0, length(test_of[,28])) # default assignment
log_bumps_pred_y[log_bumps_pred> 0.5]= 1
table(log_bumps_pred_y, test_bumps[,28])
mean(log_bumps_pred_y != test_bumps[,28])
log_bumps=step(glm(Bumps~.,data=train_bumps,family="binomial"),direction="backward")
# cross validation
cv_bumps =
glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate
_Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+
Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial")
cv.glm(train_bumps,cv_bumps,K=10)$delta[1]
y=rep(0,length(log_bumps_pred_y))
y[log_bumps_pred_y==1]=1
x=rep(0,length(test_bumps[,28]))
x[test_bumps[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ bumps")
#000000000000000000000000000000000000000000000000_otherfaults
train_of=train[,-c(28,29,30,31,32,33,35)]
test_of= test[,-c(28,29,30,31,32,33,35)]
log_of =
glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey
or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel
_Plate_Thickness,data=train_of,family = "binomial")
summary(log_of)
log_of_pred = predict(log_of, test_of, type ="response")
log_of_pred_y = rep(0, length(test_of[,28])) # default assignment
log_of_pred_y[log_of_pred> 0.5]= 1
table(log_of_pred_y, test_of[,28])
mean(log_of_pred_y != test_of[,28])
log_of=step(glm(Other_Faults~.,data=train_of,family="binomial"),direction="backward")
#ROC
y=rep(0,length(log_of_pred_y))
y[log_of_pred_y==1]=1
x=rep(0,length(test_of[,28]))
x[test_of[,28]==1]=1
roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ other faults")
# cross validation
cv_of =
glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey
or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel
_Plate_Thickness,data=train_of,family = "binomial")
cv.glm(train_of,cv_of,K=10)$delta[1]
##########PCA
#PCA on complete data set
datap=data[,-(28:35)]
fit <- princomp(datap, cor=TRUE)
summary(fit) # print variance accounted for
loadings(fit) # pc loadings
plot(fit,type="lines") # scree plot
fit$scores # the principal components
biplot(fit)
axes <- predict(fit, newdata = datap)
head(axes, 4)
fix(axes)
data1=axes[,1:7]
write.csv(data1,file="pcadata.csv")
data2=data.frame(data,data1)
write.csv(data2,file="comb_data.csv")
#SVM
install.packages("e1071")
library(e1071)
svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15)
summary(svm.fit)
predicted=predict(svm.fit,test2)
table(predicted,test.alldefects)
mean(predicted!=test.alldefects)
plot(svm.fit,train2,Length_of_Conveyor~X_Maximum,slice=list(X_Perimeter=3,Y_Perimeter=4),svSymbol=1,dataSymbol
=2,color.palette=terrain.colors )
# ROC
predictions=as.numeric(predicted,type="response")
multiclass.roc(test.alldefects,predictions,plot=T,main="ROC for SVM")
#ANN
library(nnet)
train.nnet<-nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000)
test.nnet<-predict(train.nnet,test2,type=("class"))
table(test2$alldefects,test.nnet)
mean(test.nnet!=test2$alldefects)
library(pROC)
predictions=as.numeric(test.nnet,type="response")
multiclass.roc(test2$alldefects,predictions,plot=T,main="ROC for ANN")
#read data stored in CSV file.
data=read.csv("Steel_faults.csv",header=TRUE)
attach(data)
x=data[,1:27] # input variables
y=data[,28:34] # response variables
n=1941 # total number of observations
n1=round(n*0.7) # number of observations for training
samp=sample(1:n,n1,replace=FALSE) # to select random observation
## following is the userdefined function to obtain confusion matrix
test.cl = function(true, pred) {
true = max.col(true)
cres = max.col(pred)
table(true, cres)
}
###another package for NNA
install.packages("RSNNS")
library(RSNNS)
model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F)
model=mlp(train2[,-28], train2[,28], size=2,linOut=F)
#library(devtools)
#plot.nnet(model)
test.cl(y[-samp,], predict(model, x[-samp,])) #confusion matrix for training data
test.cl(y[samp,],fitted.values(model)) #confusion matrix for testing data
#C50
crx <- data[ sample( nrow( data ) ), ]
X <- crx[,1:27]
y <- crx[,35]
trainx<- X[1:1358,]
trainy <- y[1:1358]
testx <- X[1358:1941,]
testy <- y[1358:1941]
model <- C5.0( trainx, trainy, trials=75 )
p <- predict( model, testx, type="class" )
sum( p == testy ) / length( p )
table(p, testy)
mean(p != testy)
predictions_c5 <- as.numeric(p,type="response")
multiclass.roc(testy, predictions_c5, plot=T)
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report
ISEN 613_Team3_Final Project Report

More Related Content

What's hot

An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...ijctcm
 
Dx31599603
Dx31599603Dx31599603
Dx31599603IJMER
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated ProcessIRJET Journal
 
Data mining for prediction of human
Data mining for prediction of humanData mining for prediction of human
Data mining for prediction of humanIJDKP
 
Software Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global AnalysisSoftware Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global AnalysisEditor IJMTER
 
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...(2013) A Trade-off Between Number of Impressions and Number of Interaction At...
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...International Center for Biometric Research
 
An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...
An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...
An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...drboon
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
 
Wireless Fault Detection System for an Industrial Robot Based on Statistical ...
Wireless Fault Detection System for an Industrial Robot Based on Statistical ...Wireless Fault Detection System for an Industrial Robot Based on Statistical ...
Wireless Fault Detection System for an Industrial Robot Based on Statistical ...IJECEIAES
 
Towards formulating dynamic model for predicting defects in system testing us...
Towards formulating dynamic model for predicting defects in system testing us...Towards formulating dynamic model for predicting defects in system testing us...
Towards formulating dynamic model for predicting defects in system testing us...Journal Papers
 
Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...ertekg
 
A review paper on Optimization of process parameter of EDM for air hardening ...
A review paper on Optimization of process parameter of EDM for air hardening ...A review paper on Optimization of process parameter of EDM for air hardening ...
A review paper on Optimization of process parameter of EDM for air hardening ...IJERA Editor
 
A survey of controlled experiments in software engineering
A survey of controlled experiments in software engineeringA survey of controlled experiments in software engineering
A survey of controlled experiments in software engineeringJULIO GONZALEZ SANZ
 

What's hot (18)

An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...
 
CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)
 
Dx31599603
Dx31599603Dx31599603
Dx31599603
 
PrabhuCV_2017
PrabhuCV_2017PrabhuCV_2017
PrabhuCV_2017
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated Process
 
Ijcet 06 07_004
Ijcet 06 07_004Ijcet 06 07_004
Ijcet 06 07_004
 
Data mining for prediction of human
Data mining for prediction of humanData mining for prediction of human
Data mining for prediction of human
 
Software Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global AnalysisSoftware Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global Analysis
 
30120140503008
3012014050300830120140503008
30120140503008
 
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...(2013) A Trade-off Between Number of Impressions and Number of Interaction At...
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...
 
An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...
An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...
An Application of Genetic Algorithm for Non-restricted Space and Pre-determin...
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
 
Wireless Fault Detection System for an Industrial Robot Based on Statistical ...
Wireless Fault Detection System for an Industrial Robot Based on Statistical ...Wireless Fault Detection System for an Industrial Robot Based on Statistical ...
Wireless Fault Detection System for an Industrial Robot Based on Statistical ...
 
Cv miguel
Cv miguelCv miguel
Cv miguel
 
Towards formulating dynamic model for predicting defects in system testing us...
Towards formulating dynamic model for predicting defects in system testing us...Towards formulating dynamic model for predicting defects in system testing us...
Towards formulating dynamic model for predicting defects in system testing us...
 
Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...
 
A review paper on Optimization of process parameter of EDM for air hardening ...
A review paper on Optimization of process parameter of EDM for air hardening ...A review paper on Optimization of process parameter of EDM for air hardening ...
A review paper on Optimization of process parameter of EDM for air hardening ...
 
A survey of controlled experiments in software engineering
A survey of controlled experiments in software engineeringA survey of controlled experiments in software engineering
A survey of controlled experiments in software engineering
 

Viewers also liked

utveckling-av-innovations--och-entreprenorskapsklimatet-dir
utveckling-av-innovations--och-entreprenorskapsklimatet-dirutveckling-av-innovations--och-entreprenorskapsklimatet-dir
utveckling-av-innovations--och-entreprenorskapsklimatet-dirFredrik Josefsson
 
Bulletin de nouvelles riet no 12 octobre 2016
Bulletin de nouvelles riet no 12   octobre 2016Bulletin de nouvelles riet no 12   octobre 2016
Bulletin de nouvelles riet no 12 octobre 2016RIET_INEW
 
Inew newsletter no14 january 2017
Inew newsletter no14 january 2017Inew newsletter no14 january 2017
Inew newsletter no14 january 2017RIET_INEW
 
Boletim de notícias riet n 10 – agosto 2016
Boletim de notícias riet n 10 – agosto 2016Boletim de notícias riet n 10 – agosto 2016
Boletim de notícias riet n 10 – agosto 2016RIET_INEW
 
Role of ICT in Providing Services to Student in the Private Universities in D...
Role of ICT in Providing Services to Student in the Private Universities in D...Role of ICT in Providing Services to Student in the Private Universities in D...
Role of ICT in Providing Services to Student in the Private Universities in D...Md. Nymul Islam
 
Marketing natura
Marketing   naturaMarketing   natura
Marketing naturagili0211
 

Viewers also liked (8)

utveckling-av-innovations--och-entreprenorskapsklimatet-dir
utveckling-av-innovations--och-entreprenorskapsklimatet-dirutveckling-av-innovations--och-entreprenorskapsklimatet-dir
utveckling-av-innovations--och-entreprenorskapsklimatet-dir
 
Bulletin de nouvelles riet no 12 octobre 2016
Bulletin de nouvelles riet no 12   octobre 2016Bulletin de nouvelles riet no 12   octobre 2016
Bulletin de nouvelles riet no 12 octobre 2016
 
Inew newsletter no14 january 2017
Inew newsletter no14 january 2017Inew newsletter no14 january 2017
Inew newsletter no14 january 2017
 
Anamaria pianteta
Anamaria piantetaAnamaria pianteta
Anamaria pianteta
 
Boletim de notícias riet n 10 – agosto 2016
Boletim de notícias riet n 10 – agosto 2016Boletim de notícias riet n 10 – agosto 2016
Boletim de notícias riet n 10 – agosto 2016
 
Marketing móvil presentación final
Marketing móvil presentación finalMarketing móvil presentación final
Marketing móvil presentación final
 
Role of ICT in Providing Services to Student in the Private Universities in D...
Role of ICT in Providing Services to Student in the Private Universities in D...Role of ICT in Providing Services to Student in the Private Universities in D...
Role of ICT in Providing Services to Student in the Private Universities in D...
 
Marketing natura
Marketing   naturaMarketing   natura
Marketing natura
 

Similar to ISEN 613_Team3_Final Project Report

Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...
Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...
Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...yieldWerx Semiconductor
 
Software testing effort estimation with cobb douglas function a practical app...
Software testing effort estimation with cobb douglas function a practical app...Software testing effort estimation with cobb douglas function a practical app...
Software testing effort estimation with cobb douglas function a practical app...eSAT Publishing House
 
Software testing effort estimation with cobb douglas function- a practical ap...
Software testing effort estimation with cobb douglas function- a practical ap...Software testing effort estimation with cobb douglas function- a practical ap...
Software testing effort estimation with cobb douglas function- a practical ap...eSAT Journals
 
A value added predictive defect type distribution model
A value added predictive defect type distribution modelA value added predictive defect type distribution model
A value added predictive defect type distribution modelUmeshchandraYadav5
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
 
The Role and Detection of Outliers in Semiconductor Quality Control.pptx
The Role and Detection of Outliers in Semiconductor Quality Control.pptxThe Role and Detection of Outliers in Semiconductor Quality Control.pptx
The Role and Detection of Outliers in Semiconductor Quality Control.pptxyieldWerx Semiconductor
 
A Review on Software Fault Detection and Prevention Mechanism in Software Dev...
A Review on Software Fault Detection and Prevention Mechanism in Software Dev...A Review on Software Fault Detection and Prevention Mechanism in Software Dev...
A Review on Software Fault Detection and Prevention Mechanism in Software Dev...iosrjce
 
Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...
Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...
Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...yieldWerx Semiconductor
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration AnalysisIRJET Journal
 
A practical approach to eliminate defects in gravity die cast al alloy castin...
A practical approach to eliminate defects in gravity die cast al alloy castin...A practical approach to eliminate defects in gravity die cast al alloy castin...
A practical approach to eliminate defects in gravity die cast al alloy castin...eSAT Journals
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentationNeerajNishad4
 
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptxThe Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptxyieldWerx Semiconductor
 
AI supported material test automation.
AI supported material test automation.AI supported material test automation.
AI supported material test automation.Altair
 
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptxThe Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptxyieldWerx Semiconductor
 
IRJET - Real Time Facial Analysis using Tensorflowand OpenCV
IRJET -  	  Real Time Facial Analysis using Tensorflowand OpenCVIRJET -  	  Real Time Facial Analysis using Tensorflowand OpenCV
IRJET - Real Time Facial Analysis using Tensorflowand OpenCVIRJET Journal
 
Failure analysis of polymer and rubber materials
Failure analysis of polymer and rubber materialsFailure analysis of polymer and rubber materials
Failure analysis of polymer and rubber materialsKartik Srinivas
 
Assembly Root Cause Analysis A Way To Reduce Dimensional Variation In Assemb...
Assembly Root Cause Analysis  A Way To Reduce Dimensional Variation In Assemb...Assembly Root Cause Analysis  A Way To Reduce Dimensional Variation In Assemb...
Assembly Root Cause Analysis A Way To Reduce Dimensional Variation In Assemb...Stephen Faucher
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 

Similar to ISEN 613_Team3_Final Project Report (20)

Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...
Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...
Conquering Chip Complexity with Data Analytics A New Approach to Semiconducto...
 
Software testing effort estimation with cobb douglas function a practical app...
Software testing effort estimation with cobb douglas function a practical app...Software testing effort estimation with cobb douglas function a practical app...
Software testing effort estimation with cobb douglas function a practical app...
 
Software testing effort estimation with cobb douglas function- a practical ap...
Software testing effort estimation with cobb douglas function- a practical ap...Software testing effort estimation with cobb douglas function- a practical ap...
Software testing effort estimation with cobb douglas function- a practical ap...
 
A value added predictive defect type distribution model
A value added predictive defect type distribution modelA value added predictive defect type distribution model
A value added predictive defect type distribution model
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
The Role and Detection of Outliers in Semiconductor Quality Control.pptx
The Role and Detection of Outliers in Semiconductor Quality Control.pptxThe Role and Detection of Outliers in Semiconductor Quality Control.pptx
The Role and Detection of Outliers in Semiconductor Quality Control.pptx
 
A Review on Software Fault Detection and Prevention Mechanism in Software Dev...
A Review on Software Fault Detection and Prevention Mechanism in Software Dev...A Review on Software Fault Detection and Prevention Mechanism in Software Dev...
A Review on Software Fault Detection and Prevention Mechanism in Software Dev...
 
F017652530
F017652530F017652530
F017652530
 
Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...
Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...
Advanced Methods for Outlier Detection and Analysis in Semiconductor Manufact...
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
A practical approach to eliminate defects in gravity die cast al alloy castin...
A practical approach to eliminate defects in gravity die cast al alloy castin...A practical approach to eliminate defects in gravity die cast al alloy castin...
A practical approach to eliminate defects in gravity die cast al alloy castin...
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
 
Pci journal 2003
Pci journal 2003Pci journal 2003
Pci journal 2003
 
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptxThe Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
 
AI supported material test automation.
AI supported material test automation.AI supported material test automation.
AI supported material test automation.
 
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptxThe Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
The Significance of Enhanced Yield in Semiconductor Manufacturing.pptx
 
IRJET - Real Time Facial Analysis using Tensorflowand OpenCV
IRJET -  	  Real Time Facial Analysis using Tensorflowand OpenCVIRJET -  	  Real Time Facial Analysis using Tensorflowand OpenCV
IRJET - Real Time Facial Analysis using Tensorflowand OpenCV
 
Failure analysis of polymer and rubber materials
Failure analysis of polymer and rubber materialsFailure analysis of polymer and rubber materials
Failure analysis of polymer and rubber materials
 
Assembly Root Cause Analysis A Way To Reduce Dimensional Variation In Assemb...
Assembly Root Cause Analysis  A Way To Reduce Dimensional Variation In Assemb...Assembly Root Cause Analysis  A Way To Reduce Dimensional Variation In Assemb...
Assembly Root Cause Analysis A Way To Reduce Dimensional Variation In Assemb...
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 

ISEN 613_Team3_Final Project Report

  • 1. ISEN 613- Engineering Data Analysis Naman Kapoor Vinayak Nair Rahul Garg Omkar Deshpande Adriana De La Cruz Multi-Attribute Classification of Steel Plate Defects Team 3
  • 2. Executive Summary Anomaly detection is vital in the industry and can be the difference between success and bankruptcy. Manufacturing processes need to be continuously monitored so that any change in the process can be quickly identified and controlled so that there is no production loss. This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into consideration the available historical data. The main objective of this project is to compare the working of different classification models and decide one final model that will have the least misclassification rate (high prediction accuracy). Many studies have been done by researchers on the comparative performances of multiclass classification techniques. This project adds a new dimension by drawing comparisons between error rates of multiclass techniques and individual classification techniques for each class. Although modelling for individual defects gives a very high accuracy rate, the combined practical hierarchical model will not be as efficient due to the fact that its accuracy is a product of the individual accuracies of the models used in the hierarchical model. In this project, performances of techniques such as Linear Discriminant Analysis, Logistic Regression (individual and multivariate), Random Forests (individual and multivariate), Single Decision Trees, Bagging, Support Vector Machines, and Artificial Neural Networks have been compared and analyzed. Principal Component Analysis was also used to reduce the dimensions of the given data. The challenge was to decide whether to consider different models for all seven defects or to build a model for all defects combined. The dataset also had many attributes and hence, it was difficult to select the most significant predictors and avoid over-fitting. As there were more than two responses the coding that was used for binary classification wasn’t applicable and had to deal with multi classification and new coding techniques had to be explored. Methods like Artificial Neural Networks and C5.0 were used. These methods were completely new and efforts were required in terms of literature review and coding to implement them. The following table gives the misclassification error rate and area under curve for different modeling techniques used to model the problem. Modeling Technique Misclassification Error Rate (percentage) Area Under Curve LDA 32.4 0.790 Decision Tree 36.0 0.784 Bagging 20.8 0.824 Random Forest 22.8 0.797 SVM 27.6 0.804 Neural Network Analysis 53.9 0.605 C5.0 19.4 0.831 From the above table it is clear that C5.0 modeling technique gives the least misclassification error which is 19.4 % and also highest area under the ROC curve. The main objective of this project was to compare the working of different classification models built using different modeling techniques and propose one final model that will have the least misclassification rate (high prediction accuracy). Thus, the results show the successful completion of the objectives. The final model proposed to be implemented to predict the faults is C5.0. This method has an 81.6% prediction accuracy.
  • 3. INTRODUCTION Importance of the problem: Present era is the era of quality and in today’s world of cut-throat competition and large scale production, only those manufacturers survive who can provide good quality products and services that meet or exceed the expectations of the customers. There is a need to continuously monitor the ongoing manufacturing processes so that any change in the process can be identified quickly and rectified to prevent production loss. In manufacturing, operations managers can use advanced analytics to take a deep dive into historical process data, identify patterns and relationships among discrete process steps and inputs, and then optimize the factors that prove to have the greatest effect on yield. Many global manufacturers in a range of industries and geographies now have an abundance of real-time shop-floor data and the capability to conduct such sophisticated statistical assessments. They are taking previously isolated data sets, aggregating them, and analyzing them to reveal important insights. In the steel industry, specifically alloy steel, creating different defective products can impose a high cost for steel product manufacturer. One common fault out of all others in producing low carbon steel grades is Pits & Blister defect. To remove this drawback, we need to grind the surface of the steel product. Grinding cause waste of time and involved cost of the production will be increased. Incidence of defects analysis is related to numerous factors including material analysis, production processes etc. So, if we are able to correctly predict these defects based on the important parameters then in a way we know which of the parameters to be controlled with high level of accuracy to minimize the defects and hence the defects. Thus the problem at hand in this project deals with the data from a steel industry and the results obtained from this project can be used to predict the faults and implement necessary changes. Objective: This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into consideration the available historical data. The main objective of this project is to compare the working of different classification models built using different classification techniques and propose one final model that will have the least misclassification rate (high prediction accuracy). Various data mining techniques can be used to predict the steel plate faults from the given data. In this project, the results of classification techniques such as Linear Discriminant Analysis, Logistic Regression (individual and multivariate), Random Forests (individual and multivariate), Single Decision Trees, Bagging, Support Vector Machines, and Artificial Neural Networks have been compared and the best model is proposed. The model building also uses Principal Component Analysis to reduce the dimensions of the given data. Scope of Work: 13th-15th 16th-19th 20th-23rd 24th-27th 28th-31st 1st-4th 5th-8th 9th-12th 12th-15th 1 Retrievingdataandunderstandingitsdetails 2 LiteratureReview&selectingsuitablesupervisedneuralnetworkmethod 3 Modelbuildingusingclassificationtechniqueslearnedinclass 4 Modelbuildingusingselectedneuralnetworkmethod: 5 Predictingresultsandconcludingthebestmodelingmethod 6 Reportmaking&documentation SNo Actvity November'2015 GANTChart December'2015
  • 4. LITERATURE REVIEWS Following were the papers selected: 1. Steel Plates Faults Diagnosis with Data Mining Models. Fakhr,M., Elsayad, A. M. (2012). (Reviewed by Naman Kapoor) 2. Machine Learning Techniques for Anomaly Detection: An Overview. Omar,S. Ngadi,A. and Jebur, H. H. (2013). (Reviewed by Naman Kapoor) 3. Neuralnet: Training of neural networks. Frauke Günther and Stefan Fritsch (2008). (Reviewed by Omkar Deshpande) 4. An Empirical Comparison of Supervised Learning Algorithms. Caruana, R., & Niculescu-Mizil, A. (2006). (Reviewed by Omkar Deshpande) 5. A SVM-based pipeline leakage detection and pre-warning system. Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin. (2010). (Reviewed by Rahul Garg) 6. Steel faults diagnosis under predictive analysis. Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, (2013). (Reviewed by Rahul Garg) 7. Classification of EEG signals using neural network and logistic regression. A. Subasi and E. Erçelebi. (2005). (Reviewed by Rahul Garg) 8. A study of decision tree ensembles and feature selection for steel plates faults detection. Halawani, M. (2014). (Reviewed by Vinayak Nair) 9. Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Tsoi, A.C., Pearson, R.A. (1991). (Reviewed by Vinayak Nair) 10. Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study. Pohar, M., Blas, M., & Turk, S. (2004). (Reviewed by Vinayak Nair) Combined Takeaways  Advanced decision trees are extremely efficient modeling techniques for multiclass classification problems.  Artificial Neural Networks are a very powerful and complex algorithms but they have certain issues of convergence and variable selection which need to be addressed.  Supervised machine learning techniques significantly outperform the unsupervised ones when it comes to multi classification problems.  LDA is advisable in comparison to logistic regression, when the variables are normally distributed.
  • 5. Reviewed by: Naman Kapoor Steel Faults Diagnosis with Data Mining Models -Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science, vol. 8, no. 4, pp. 506-514, 2012. Objective: The key problem this paper addresses is the formation of an appropriate intelligent data mining model for anomaly detection in the manufacturing industry on a particular dataset. Addressing this problem is important due to the need to create intelligent fault diagnostic models with the help of data mining to enhance the quality of manufacturing and to lessen the cost of product testing. It not only helps to keep away product quality problems but also facilitates precautionary maintenance. The key objective of this paper is to use predictive analytics to select the best classification model for the selected steel plate faults detection dataset by comparing different models using certain statistical measures. The authors have addressed this problem by evaluating the performances of three of the popular and effective data mining models (using supervised learning techniques) on the selected dataset and have presented their views and outcomes on these. From their approach the authors found that the C5.0 decision tree with boosting achieved the best results on the dataset which implies that decision trees have a greater impact on fault diagnosis than fellow supervised learning techniques. Approach: The authors approached the problem by performing three multi classification techniques namely C5.0 decision tree (C5.0 DT) with boosting, Multi Perception Neural Network (MLPNN) with pruning and Logistic Regression (LR) with step forward on the steel plates fault dataset obtained from the University of California at Irvine (UCI) machine learning repository. These models were formulated to diagnose seven commonly occurring faults of steel plate namely: Pastry, Z_Scratch, K_Scratch, Stains, Dirtiness, Bumps and other faults. A brief description of the techniques used is presented below: I. C5.0 decision tree The C5.0 DT algorithm is an improved version of the C4.5 and ID3 algorithms. C5.0 uses information gain as a measure of purity, which is based on the notion of entropy. This method proved to be a major take out for us for our project. The three methods used in the C5.0 tree construction are boosting, pruning and winnowing. While boosting and pruning were known to us we were introduced to the concept of winnowing which preselects a subset of the attributes that are selected to construct the tree. Winnowing ensures that the attributes which are irrelevant are excluded from the tree building process. The authors used only 13 attributes (from 27) from the dataset to build the C5.0 tree. II. Multilayer Perceptron Neural Network (MLPNN) Artificial Neural Networks (ANNs) are biologically motivated and highly sophisticated analytical techniques capable of modelling extremely complex nonlinear functions. The technique used to address this problem: MLPNN is considered a powerful function approximate for prediction and classification problems, its structure is organized into layers of neurons input, output and hidden layer. The MLPNN was trained using the Back Propagation (BP) training technique. In this study the network was trained using the pruning approach which starts with a large network and removes the weakest neurons (prunes) in the hidden and input layers as the training proceeds.
  • 6. III. Logistic Regression Logistic regression is a nonlinear regression technique for the prediction of dichotomous (binary) class attribute in terms of the predictive ones. This algorithm does not predict the class attribute but predicts the odds of its occurrence using the log likelihood (logit) function. Results The performance of each model was evaluated using three statistical measures: classification accuracy (compliment of misclassification error rate), sensitivity and specificity. These measures are define using the values of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). These charts depict that the performances of the C5.0 learning algorithm is the best model for training and test subsets. Neural network model is the second best and finally the logistic regression is the worst one. Summary The major takeaways from this study were as follows:  Advanced Decision Trees (C5.0 DT) are a very powerful data mining tool to use in predictive analytics of multiclass anomaly detection with very high accuracy.  Multilayer Perceptron Neural Networks with back propagation is a complex algorithm which standard and simple to implement, it still suffers from convergence issues and requires initialization and adjustment of many individual parameters to optimize its performance.  Logistic Regression while although a very powerful modeling tool, assumes that the class attribute (the log odds, not the event itself) is linear in the coefficients of the predictive attributes. The right inputs must be chosen with their functional relationship to the class attribute.  Amount, quality and the measuring process of data are key components of diagnostic accuracy.
  • 7. Reviewed by: Naman Kapoor Machine Learning Techniques for Anomaly Detection: An Overview -S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013. Objective: The key problem this paper addresses is the issue of anomaly detection in the industry. Through this paper the authors try to aid anomaly detection with the aid of machine learning techniques. The key reason why addressing this problem is important because even though after many years of research the anomaly detection community is still confronting difficult problems, the authors aim to further research this issue. The key objective of his paper is to present an overview of research directions for applying supervised and unsupervised methods for managing the problem of anomaly detection. The authors address this problem by providing a general architecture of anomaly intrusion detection systems and conduct detailed discussions on the various machine learning techniques that come under supervised and unsupervised learning and discussing their strengths and weaknesses in handling anomaly detection. Approach: The authors approached the problem by comparing different techniques under supervised and unsupervised machine learning techniques and bringing out their strengths and weaknesses on anomaly detection. An overview of the two approaches is given below: I. Supervised Anomaly Detection Supervised methods (also known as classification methods) required a labelled training set containing both normal and anomalous samples to construct the predictive model. Theoretically, supervised methods provide better detection rate than semi-supervised and unsupervised methods, since they have access to more information. However, there exist some technical issues, which make these methods seem not accurate as they are supposed to be. II. Unsupervised Anomaly Detection These techniques do not need training data. As alternative, they based on two basic assumptions. First, they presume that most of the network connections are normal traffic and only a very small traffic percentage is abnormal. Second, they anticipate that malicious traffic is statistically various from normal traffic. According to these two assumptions, data groups of similar instances which appear frequently are assumed to be normal traffic, while infrequently instances which considerably various from the majority of the instances are regarded to be malicious. The different techniques compared are shown in the table below: Supervised Macine Learning Unsupervised Macine Learning K-Nearest Neighbours Self Organising Maps Neural Networks K-means Clustering Decision Trees Fuzzy C-means Clustering Support Vevtor Machines Expectation-Maximization Meta Machine Learning Techniques for Anomaly Detection
  • 8. Results The results are shown in the table below: Summary The major takeaways from this review were:  Machine learning techniques have received considerable attention among the anomaly detection researchers.  Anomaly detection comprises supervised techniques and unsupervised techniques.  The experiments demonstrated that the supervised learning methods significantly outperform the unsupervised ones if the test data contains no unknown attacks.  Among the supervised methods, the best performance is achieved by the non-linear methods, such as SVM, multi-layer perceptron and the rule-based methods.  Techniques for unsupervised such as K-Means, SOM, and one class SVM achieved better performance over the other techniques although they differ in their capabilities of detecting all attacks classes efficiently.
  • 9. An Empirical Comparison of Supervised Learning Algorithms Reviewed by: Omkar Deshpande Reference: Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms. Proceedings Of The 23Rd International Conference On Machine Learning - ICML '06. http://dx.doi.org/10.1145/1143844.1143865. Objective: The objective of this paper is to give an empirical comparison between supervised learning algorithms such as SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. The main reason behind this paper is to publish comparison between these newly developed algorithms as the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. The key objective of this paper is to provide the comparison between the algorithms based on variety of performance criteria such as Precision/Recall, ROC, Lift, Accuracy, F-score, squared error, etc. As a part of results after empirical comparison it was found that boosted trees were the best learning algorithm overall. Random forests are close second, followed by un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best performance in predicting values as compared to other methods like random forest, SVM etc. Approach: The model used by the authors for this paper is ADULT, COV TYPE and LETTER are from the UCI Repository (Blake & Merz, 1998). COV TYPE has been converted to a binary problem by treating the largest class as the positive and the rest as negative. Random 5000 cases were taken as training data set and rest as testing. Also from those 5000 cases 4000 were used for training and 1000 cases to calibrate the model. Now various parameters like ROC, accuracy, lift are calculated for different algorithms and then a column is obtained which gives mean normalized score for the eight metrics when model selection is done by cheating and looking at the final test sets. The means in this column represent the best performance that could be achieved with each learning method if model selection were done optimally. Results: From comparison it is seen that the models which perform the best can perform poorly than the average performing models. For example, the best models on ADULT are calibrated boosted stumps, random forests and bagged trees. Boosted trees perform much worse. Bagged trees and random forests also perform very well on MG and SLAC. On MEDIS, the best models are random forests, neural nets and logistic regression. The only models that never exhibit excellent performance on any problem are naive bayes and memory-based learning. that boosted trees were the best learning algorithm overall. Random forests are close second, followed by un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best performance in predicting values as compared to other methods like random forest, SVM etc. The table below gives the results that were used for the comparison of the techniques.
  • 10. Summary: It can be seen that boosted trees were the best learning algorithm overall. Random forests are close second, followed by un- calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best performance in predicting values as compared to other methods like random forest, SVM etc. But this is not always the case. We need to select the parameters carefully and then select the technique which best works for that parameter. For example Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks, etc. So it can be said that for medicinal area the model which gives high performance when it comes to ROC will be the best model.
  • 11. Reviewed By: Omkar Deshpande Training of Neural Networks Reference: Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1, June 2010 Objective: The objective of this paper is to discuss the algorithm used in the neuralnet package and give its application in R. Also discuss the advantages of the neuralnet package over other generalized linear models. The main reason behind publishing this paper is to give the details of the neuralnet package developed by the authors and also give the working example of it by using the infert dataset in R. Artificial neural networks can be applied to approximate any complex functional relationship between input and output variables. Unlike generalized linear models it is not necessary to pre specify the type of relationship between covariates and response variables as for instance as linear combination. This makes artificial neural networks a valuable statistical tool. They are in particular direct extensions of GLMs and can be applied in a similar manner. Approach: In this paper the authors first discuss the algorithm used in building the neuralnet package. Then the training of the neuralnet model in R is discussed. Infert dataset is used for this purpose. The number of hidden neurons is determined in relation to the needed complexity. A neural network with for example two hidden neurons is trained. Then the results of backprop nnet and neuralnet are compared. Then the paper discusses the additional features such as the compute function, confidence.interval function that come loaded in the neuralnet package. Results: This being an informative paper, discusses various functions available in neuralnet package and how to implement them in R. A few comparative results provided in the paper include the comparison with the nnet package. For comparison, neural networks are trained with the same parameter setting as above using neuralnet with algorithm="backprop" and the package nnet. nn.bp and nn.nnet show equal results. Both training processes last only a very few iteration steps and the error is approximately 158. Thus in this little comparison, the model fit is less satisfying than that achieved by resilient backpropagation.
  • 12. Summary: This paper introduced multi layer perceptron and supervised learning. It also took into consideration the use of the package neuralnet available in R for modeling functional relationships between covariates and response variables. neuralnet contains a very flexible function that trains multilayer perceptrons to a given data set in the context of regression analyses. It is a very flexible package since most parameters can be easily adapted. For example, the activation function and the error function can be arbitrarily chosen and can be defined by the usual definition of functions in R.
  • 13. Reviewed By: Rahul Garg A SVM-BASED PIPELINE LEAKAGE DETECTION AND PRE-WARNING SYSTEM Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system", Measurement, vol. 43, no. 4, pp. 513-519, 2010. Objective: This paper addresses the detection problem of pipeline leakages which may occur due to various reasons like manual digging and illegal construction. This paper indicates the effectiveness of SVM over traditional machine learning techniques which is based on the assumption of availability of infinite training data. The problem of gas leakages are a concern for industries as they lead not only to huge monetary losses but also may have very tragic outcomes like outbreak of diseases and even deaths. The timely detection of any suspected leakages can be very beneficial to the industries as well as to general masses. The objective of this paper is to monitor and locate the possible abnormal events (e.g. manual digging above a pipeline and illegal constructions, etc., which might cause a pipeline leakage) along pipeline before a leakage takes place, a new pipeline leakage detection and pre-warning system. The authors of this paper have employed SVM as the classifier to recognize these abnormal events. Three cases gas leakage, manual digging and human walk above the pipeline were created and a series of experimental trials were used to train the model. Next, this model was used to detect any abnormal events for classification and it provided quite accurate results. The authors found that SVM can prove to be a lot better and accurate technique for predicting gas leakages along pipelines as compared to the empirical risk minimization method. This implies that although SVM is comparatively a new technique but it is quite accurate for predictive analytics in case of multi classification problems. Approach: The authors have followed the multi-classification approach of predictive data analytics. Since, there was no historical data available for, the authors collected data for training the model by conducting various trials. Basically two types of trials were done for this. One was the abnormal events identification and other abnormal events location trial. Three cases namely gas leakage; manual digging and human walk above the pipeline were created and number of columns or prediction terms is eight. Twenty samples were collected from each case randomly for training and ten samples were chosen from every case to test the trained SVM model. The misclassification rate on the test data will tell, how accurately the model has performed and whether it can be deployed in actual practices or not. For, training process “one-against-one” method is employed. The multi-class SVM trained classifier which the authors obtained is shown in below figure. The two axes are the first two predictors out of eight and the circled data points show the support vectors. Results:
  • 14. The detection results from SVM recognized correct cause of leakage more than 95 % of the times and locate abnormal events quite accurately. Below photo shows the prediction results, where 1, 2 and 3 are three different categories of abnormality. Out of the below results only sample 12 has been recognized incorrectly. Summary: This paper represented the problem of pipeline leakage detection which was a problem of multi classification predictive analysis. The major take away from this review is that SVM can work quite accurately for multi classification especially in cases where the training data is not very large like in this case of leakage detection. This technique is far better as compared to the traditional machine learning methods like ERM. Among several methods available for multi-class classification ‘‘one-against-one” SVM method is more suitable for practical use than others.
  • 15. Reviewed By: Rahul Garg STEEL FAULTS DIAGNOSIS USING PREDICTIVE ANALYSIS Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13. Objective: The key problem which this paper discusses is the generation of various types of defects in manufactured steel plates especially made of alloyed steel in steel industry. It is quite imperative to address this problem because rectifying these defects by grinding or milling causes waste of time and augments the cost of production which could be prevented. This paper aims at performing steel fault diagnosis using predictive analytics, so that the defect generation rate can be minimized by finely tweaking the factors responsible for it. To address this problem the authors of this paper have used classification modeling techniques namely Decision trees, Multilayer perceptron neural networks and Logistic regression to develop a model which will diagnose the faults as accurately as possible. After developing different models it was found that decision trees provide the best results as it has lowest misclassification rate. This implies that using Decision Trees model could be a good option for steel fault diagnosis using data mining techniques. Approach: The data set used in this review has been taken from UCI repository and it classifies steel plate faults into seven different types which makes this a case of multi classification predictive analysis. The authors of this paper have tried various methods of classification and then selected the best method based on the misclassification rate. The methods used are decision trees, multilayer perceptron neural networks and logistic regression. The C4.5 boosting algorithm with 10 trials was used for decision trees. After all the models were built, and one out of the three was selected, the genetic algorithm was used to find the best optimal solution, which works as per the following: 1. Initialize random population of n chromosomes 2. Evaluate the fitness value f(x) of each chromosome x in population 3. Create a new population by repeating following steps  Select two parent chromosomes from given population according to fitness. (Chromosomes having better fitness value, the bigger chance to be chosen)  Cross over the parents to form a new offspring. If no crossover then offspring is an exact copy of the parents.  Mutate new offspring at each locus.  Place new offspring in the new population. 4. Use new generated population for a further execution. 5. If the end condition is fulfilled, stop, and return best solution in the current population. 6. Go to step 2 The best optimal solution chosen in this case was the solution number seven based on the output results shown by this genetic algorithm.
  • 16. Results: The results from this review have been shown in the below table: S No. Method Classification Accuracy Classification Error 1. Decision Tree 94.38 % 5.62 % 2. Multilayer Perceptron 83.87 % 16.13 % 3. Logistic Regression 72.64 % 27.36 % The above table shows that out of the three classification techniques used, decision trees gave the best results as the misclassification rate with decision trees is the least. Summary: This review helped me in gaining insights on the methods we can try for a multi classification predictive analytics problem and various ways that can be used to improve those models. C4.5 algorithm can be used to improve decision trees and pruning algorithm can be used to improve the multilayer perceptron model. Another important take away from this review was the fact that boosted Decision Trees with C4.5 package performed the best in classifying various steel defects out of the three models especially when the results have to be interpreted by humans.
  • 17. Reviewed By: Rahul Garg CLASSIFICATION OF EEG SIGNALS USING NEURAL NETWORK AND LOGISTIC REGRESSION A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005. Objective: This paper is about the detection of epileptiform discharges in the EEG using logistic regression and artificial neural network models. Epileptic seizure can occur in many different ways and EEG signals carry a lot of information and accurate classification and evaluation of these signals may turn out be a breakthrough in medical science domain. This paper aims to compare the traditional method of logistic regression to the more advanced neural network techniques, as mathematical tools for developing classifiers for the detection of epileptic seizure in multi-channel EEG. The authors have developed two different models using logistic regression and artificial neural networks. Multilayer perceptron neural network (MLPNN) will be used with back propagation and Levenberg—Marquardt training algorithm. After this a comparison has been done in both the methods. After comparing the results from both the papers, the authors concluded that the neural network analysis proved to be a better model than the logistic regression. This implies that MLPNN is more accurate and easier to build, as for developing logistic regression equations we start with no knowledge as to the best combination of the parameters or the shape and degree of nonlinearity required to produce an optimal model. Approach: The EEG data used in this study was downloaded from 24-h EEG recorded from both epileptic patients and normal subjects. In order to assess the performance of the classifier, 500 EEG segments were selected containing spike and wave complex, artifacts and background normal EEG. Twenty absence seizures (petit mal) from five epileptic patients admitted for video-EEG monitoring were analyzed. Next each of the signals was inspected by experienced neurologists to score epileptic and normal signals. After this wavelet transform analysis was done as it captures transient features and localizes them in both time and frequency content accurately. Next logistic regression and neural network classifiers were developed randomly selecting 300 examples out of 500 available as the training set and remaining 200 were kept for testing and validating the developed models. The selection of the optimal network was based on monitoring the variation of error and some accuracy parameters as the network was expanded in the hidden layer size and for each training cycle. The sum of square errors was used for choosing the optimal model and the optimum number of nodes in the hidden layer was found to be 21. Finally after testing both the developed models, the best one was chosen based on the misclassification error rate and sensitivity-specificity analysis. Below table shows the division of the collected data as training and testing data.
  • 18. Results: Below table provides the results from comparing the two models on the basis of accuracy of classification and sensitivity and specificity analysis on test data. We can clearly see that the MLPNN has more accuracy and larger area under the ROC curve. Summary: This paper helped in having a better understanding of the neural network analysis which is a new technique apart from the techniques learnt in class. It provided insights on the procedure of choosing optimal number of hidden layers in the model and limitations of logistic regression model. Another major take away from this is the evaluation and comparison of traditional logistic regression model used for classification with a much newer multilayer perceptron neural network analysis. At last but not the least, this paper made me aware of the wavelet transform analysis which is very effective in capturing transient features and localizing them in both time and frequency domains.
  • 19. Reviewed by: Vinayak Nair A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAULTS DETECTION Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults detection. International Journal of Technical Research and Applications, 2(4), 127-131. 1. Objective: Detection of steel plate defects is a serious problem in the industry and often, they’re performed by human operators, which is expensive and slow. This can be tackled if the process is automated. The paper shows the application of decision tree ensembles for fault detection. Many decision tree ensembles random subspace, bagging, adaBoost.M1 and random forests are used to perform the steel plate fault detection and the best method for this problem is found out. The effect of removing insignificant features is also studied. The results suggest that random AdaBoost.M1 and random subspace are the best ensemble methods with a prediction accuracy greater than 80%. 2. Approach: Random subspaces, bagging, adaBoost.M1 and random forests classifier ensembles were performed on the UCI dataset and the prediction accuracies were calculated. Different selection of predictors were also tried out. 3. Results: The classification errors for the methods with different selection of predictors are tabulated below: The classification error with all the predictors included: The classification error with 20 most important predictors included: The classification error with 15 most important predictors included:
  • 20. Random Subspace performed the best for the first and third method. AdaBoost.M1 came first for the second method. When the best 20 predictors were selected, results of all the methods improved except Random Subspaces. In the third method, the performance of models reduced, which indicated some important predictors were missed out. 4. Summary: The single decision tree model always gave bad results in comparison to Random Subspace, Adaboost.M1, Bagging and Random Forests, which means that we’ll have to use decision tree ensembles in the project, as well. Feature selection is also very important as selecting the most important predictors reduces the error rate.
  • 21. Reviewed by: Vinayak Nair COMPARISON OF THREE CLASSIFICATION TECHNIQUES, CART, C4.5 AND MULTI-LAYER PERCEPTRONS Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969. 1. Objective: There are many popular algorithms such as CART (Classification and regression tree), MLP (multilayer perceptron) and C4.5. There is a need to know how these methods compare against each other. By comparing different methods on constrained data, we can make qualitative statements about the methods. Hence, addressing this problem can help individuals in making less mistakes while applying a particular method to practical problems. The key objective is to compare 3 algorithms, CART, MLP and C4.5 on classification and generalization capabilities. The algorithms are carried out on a version of the Penzias example and the results are summarized. It was found that generally, the MLP has better classification and generalization accuracies compared with the other two algorithms. 2. Approach: For comparing the classification performance, data known as the clump example ( 8th order Penzias) was used. All 256 examples were used as both training and testing data sets. For comparing the generalization performance, the same data is used and the first 200 example are set as training set and the rest as testing. Parameters used: In the MLP, both the learning rate and the momentum are set at 0.1. The architecture used is: 8 input neurons, 5 hidden layer neurons, and 4 output neurons. In CART, the prior probability is set to be equi-probable. The pruning is performed when the probability of the leaf node is equal 0.5. In C4.5, all the default values are used. 3. Results: Following are the classification results: Where mlp1 and mlp2 are the values related to the MLP when it has run for 10000 iterations and 100000 iterations respectively. The MLP accuracies improve with the number of iterations (till about 20000 iterations). Following are the generalization results on the same data:
  • 22. The generalization accuracy of the MLP is observed to be better than CART, and is comparable to C4.5. 4. Summary: It is found that the MLP once it is converged, in general, has a better classification and generalization accuracies compared with CART, or C4.5. On the other hand it is also noted that the prediction errors made by each algorithm are different. This indicates that there may be a possibility of combining these algorithms in such a way that their prediction accuracies could be improved. This is presented as a challenge for future research.
  • 23. Reviewed by: Vinayak Nair COMPARISON OF LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSIS: A SIMULATION STUDY Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study. Metodoloski zvezki, 1(1), 143-161. 1. Objective Linear Discriminant Analysis (LDA) and Logistic Regression (LR) are two widely used statistical methods. Though both of them can be used to develop linear classification models, we need to have a set of guidelines for proper selection. While LR makes no assumptions on the distribution of the explanatory data, LDA has been developed for normally distributed explanatory variables. The appropriate method to a problem would always give better results. The objective of the paper is understand when to choose LDA and logistic regression. The two methods are compared and performance is studied using simulations. The results of LDA and LR were found to be close whenever the normality assumptions are not too badly violated, and some guidelines were set for recognizing these situations. The inappropriateness of LDA in all other cases is discussed. 2. Approach: The simplest and the most frequently used criterion for comparison between the two methods is classification error (percent of incorrectly classified objects; CE). However, classification error is a very insensitive and statistically inefficient measure (Harrell, 1997). Harrell and Lee (1985) proposed four different measures of comparing predictive accuracy of the two methods. These measures are indexes A, B, C and Q. They are better and more efficient criteria for comparisons and they tell us how well the models discriminate between the groups and/or how good the prediction is. Where, Pk denotes an estimate of P(Yk=1|Xk) from (2.1) and I is an indicator function ,Pi is a probability of classification into group i, Yi is the actual group membership (1 or 0), and n is the sample size of both populations. Random samples of size n and m from two multivariate normal populations with different mean vectors, but equal covariance matrix Σ. The mean vector of one group is always set at (0,0). The distance to the other one is measured using Mahalanobis distance, while the direction is set as the angle (denoted by υ) to the direction of the eigenvector of the covariance matrix. Each sample is then randomly divided into two parts, a training and a test sample. The coefficients of LDA and LR are computed using the first sample and then predictions are made in the second one. The sampling experiment is replicated 50 times. Each time the indexes for both methods are computed. Finally, the average value of indexes and the proportion of simulations in which LR performs better are recorded. After sampling, the normally distributed variables can be categorized, either only one or both of them. The minimum and maximum value are computed, then the whole interval is divided into a certain number of categories of equal size. 3. Results: The sample size has the most obvious impact on the difference between methods. LDA assumes normality and the errors it makes in prediction are only due to the errors in estimation of the mean and variance on the sample. On the contrary, LR adapts itself to distribution and assumes nothing about it. Therefore, in the case of small samples, the difference between the distribution of the training sample and that of the test sample can be substantial. But, as the sample size increases, the sampling distributions become more stable which leads to better results for the LR.
  • 24. Consequently, the results of the two methods are getting closer because the populations are normally distributed. The results from Table 1 confirm this consideration. As the sample size increases, the LDA coefficient estimations become more accurate and therefore all four indexes are improving. The LR indexes are increasing even faster, thus approaching those of LDA. Decreasing difference between the two methods is best presented with the Q index, which is the most sensitive one. As the differences between index means are negligible, it is also interesting to look at the proportion of simulations where LR performs better. It can be seen that the value of rates to which we pay special attention that of B index and of Q index, is constantly increasing. In the case of other changes, the results of the two methods were found to remain very close; in fact LDA is only a little bit better than LR. Simulations are carried out to study the effects of categorization and non-linearity, but are not presented in this literature review due to a lack of space. However, the major takeaways from the results have been summarized in the next section. 4. Summary: LDA is a more appropriate method when the explanatory variables are normally distributed. In the case of categorized variables, LDA remains preferable and fails only when the number of categories is really small (2 or 3). The results of LR,
  • 25. however, are in all these cases constantly close and a little worse than those of LDA. But whenever the assumptions of LDA are not met, the usage of LDA is not justified, while LR gives good results regardless of the distribution. As the estimates for LR are obtained by the maximum likelihood method, they have a number of nice asymptotic properties as well.
  • 26. Project Approach Analysis Flow Chart: Problem Description Propose the best model with highest prediction accuracy that can be implemented in the steel plate manufacturing process to detect faults during the process and thus help in reducing them by taking proper preventive measures. The assumptions are  The data available is the exact data that is taken from the production line and has no manipulations.
  • 27.  The data is not biased and is randomly selected data from different production lines (if present) and collected over a period of time. Given Data The data used for this project is taken from the UCI library. This dataset consists of 7 different steel plate faults and 27 attributes which contain the features of the steel plate manufactured and also the manufacturing process. Data Set Information: Type of dependent variables (7 Types of Steel Plates Faults): 1.Pastry 2.Z_Scratch 3.K_Scatch 4.Stains 5.Dirtiness 6.Bumps 7.Other_Faults Attribute Information: 27 independent variables: X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity Length_of_Conveyer TypeOfSteel_A300 TypeOfSteel_A400 Steel_Plate_Thickness Edges_Index Empty_Index Square_Index Outside_X_Index Edges_X_Index Edges_Y_Index Outside_Global_Index LogOfAreas Log_X_Index Log_Y_Index Orientation_Index
  • 28. Luminosity_Index SigmoidOfAreas Preliminary analysis of the data: X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas 1 42 50 270900 270944 267 2 645 651 2538079 2538108 108 3 829 835 1553913 1553931 71 4 853 860 369370 369415 176 5 1289 1306 498078 498335 2409 6 430 441 100250 100337 630 X_Perimeter Y_Perimeter Sum_of_Luminosity 1 17 44 24220 2 10 30 11397 3 8 19 7972 4 13 45 18996 5 60 260 246930 6 20 87 62357 Minimum_of_Luminosity Maximum_of_Luminosity 1 76 108 2 84 123 3 99 125 4 99 126 5 37 126 6 64 127 Length_of_Conveyor TypeOfSteel_A300 TypeOfSteel_A400 1 1687 1 0 2 1687 1 0 3 1623 1 0 4 1353 0 1 5 1353 0 1 6 1387 0 1 Steel_Plate_Thickness Edges_Index Empty_Index 1 80 0.0498 0.2415 2 80 0.7647 0.3793 3 100 0.9710 0.3426 4 290 0.7287 0.4413 5 185 0.0695 0.4486 6 40 0.6200 0.3417 Square_Index Outside_X_Index Edges_X_Index Edges_Y_Index 1 0.1818 0.0047 0.4706 1.0000 2 0.2069 0.0036 0.6000 0.9667 3 0.3333 0.0037 0.7500 0.9474 4 0.1556 0.0052 0.5385 1.0000 5 0.0662 0.0126 0.2833 0.9885 6 0.1264 0.0079 0.5500 1.0000 Outside_Global_Index LogOfAreas Log_X_Index Log_Y_Index 1 1 2.4265 0.9031 1.6435 2 1 2.0334 0.7782 1.4624 3 1 1.8513 0.7782 1.2553 4 1 2.2455 0.8451 1.6532 5 1 3.3818 1.2305 2.4099 6 1 2.7993 1.0414 1.9395 Orientation_Index Luminosity_Index SigmoidOfAreas Pastry 1 0.8182 -0.2913 0.5822 1 2 0.7931 -0.1756 0.2984 1 3 0.6667 -0.1228 0.2150 1 4 0.8444 -0.1568 0.5212 1
  • 29. 5 0.9338 -0.1992 1.0000 1 6 0.8736 -0.2267 0.9874 1 Z_Scratch K_Scratch Stains Dirtiness Bumps Other_Faults 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0 6 0 0 0 0 0 0 The preliminary data analysis shows that  There is at least one defect associated with every row of attributes.  There are 1941 entries of defects in the whole dataset which is actually equal to the total number of rows in the dataset.  Other_Faults account for the majority of the defects; almost 35% of the defects recorded are Other_Faults. Thus it can be fairly predicted that the misclassification error may go high for this defect.  No two defects are associated with a single row of input. That is only one defect occurs for a particular row of attributes in the data. Description of new techniques used: The new techniques chosen for this project are neural network analysis and C5.0 Decision trees. 1) Artificial Neural Networks What is a Neural Network? An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. Type of Defect Number of Occurrences Pastry (1) 158 Z_Scratch (2) 190 K_Scratch (3) 391 Stains (4) 72 Dirtiness (5) 55 Bumps (6) 402 Other_Faults (7) 673
  • 30. Components of a neuron The synapse The figure above shows the structure of the human neural system. In the human brain, a neuron gets signals from all the parts of the body through huge number of Dendrites. The neuron then sends signals as electrical activity through Axon. Learning occurs by change in the energy levels of the neurons. What is Artificial Neural Networks? Artificial neural network is a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. The structure of a neural-network algorithm has three layers:  The input layer feeds past data values into the next (hidden) layer. The black circles represent nodes of the neural network.  The hidden layer encapsulates several complex functions that create predictors; often those functions are hidden from the user. A set of nodes (black circles) at the hidden layer represents mathematical functions that modify the input data; these functions are called neurons.  The output layer collects the predictions made in the hidden layer and produces the final result: the model’s prediction.  Neurons in a neural network can use sigmoid functions to match inputs to outputs. When used that way, a sigmoid function is called a logistic functionand its formula looks like this:  f(input) = 1/(1+eoutput ) Artificial Neural Network Representation 2) C5.0 Decision Trees Decision tree can be considered as a system that allows organizing a huge amount of information graphically.
  • 31. A decision tree consists of internal nodes that represent the decisions corresponding to the hyper-planes or split points (i.e., which half-space a given point lies in), and leaf nodes that represent regions or partitions of the data space, which are labeled with the majority class. A region is characterized by the subset of data points that lie in that region. One of the advantages of decision trees is that they produce models that are relatively easy to interpret. In particular, a tree can be read as set of decision rules, with each rule’s antecedent comprising the decisions on the internal nodes along a path to a leaf, and its consequent being the label of the leaf node. Further, because the regions are all disjoint and cover the entire space, the set of rules can be interpreted as a set of alternatives or disjunctions. An example of the decision tree is seen in the following figure. C5.0 algorithm acts similar to ID3 but improves a few of ID3 behaviors: The new features (versus ID3) are: 1) Accepts both continuous and discrete features 2) Handles incomplete data points 3) Pruning is already included in the package and thus the results are after pruning. 4) Ability to use attributes with different weights. 5) Scalability is enhanced by multi-threading; C5.0 can take advantage of computers with multiple CPUs and/or cores
  • 32. IMPLEMENTATION Linear Discriminant Analysis The modified multiclass dataset was modeled using Linear Discriminant Analysis to obtain the confusion matrix and misclassification error rate for the test dataset. K-fold cross validation was also performed on the test data to confirm our results. >##LDA >lda.model= lda(alldefects~., data = train2) >lda_pred= predict(lda.model, test2) >table(lda_pred$class, test.alldefects) test.alldefects A B C D E F G A 25 0 0 0 0 1 15 B 5 50 0 0 0 4 10 C 2 0 91 0 0 0 4 D 0 0 1 26 0 0 2 E 3 0 0 0 18 0 6 F 4 2 4 0 1 67 40 G 9 5 27 1 4 39 117 >mean(lda_pred$class!= test.alldefects) [1] 0.3241852 >##CrossValidation > lda.cv=lda(alldefects~.,test2, CV=TRUE) >table(lda.cv$class,test.alldefects) test.alldefects A B C D E F G A 24 0 1 0 1 1 15 B 5 48 0 0 0 5 10 C 0 0 104 0 0 0 2 D 1 0 1 24 0 0 1 E 3 0 0 0 16 0 9 F 4 3 1 0 1 61 42 G 11 6 16 3 5 44 115 >mean(lda.cv$class!= test.alldefects) [1] 0.3276158 The misclassification and cross validation error were 32.42% and 32.76% respectively. Decision Tree A single decision tree was then modelled on the modified dataset. The tree was also pruned to reduce the number of branches and simplify the tree. > ##tree >library(tree) > tree1=tree(train2$alldefects~.,data=train2) >plot(tree1) >text(tree1 ,pretty =0)
  • 33. >tree.pred=predict(tree1,test2,type="class") >table(tree.pred ,test.alldefects) test.alldefects tree.pred A B C D E F G A 0 0 0 0 0 0 0 B 3 51 0 0 0 2 5 C 0 0 98 0 0 0 2 D 0 0 0 23 0 0 1 E 0 0 0 0 0 0 0 F 6 0 0 1 7 80 56 G 39 6 25 3 16 29 130 >mean(tree.pred!=test.alldefects) [1] 0.3447684 > ##pruning >set.seed (1) >cv.data =cv.tree(tree1 ,FUN=prune.misclass ) >plot(cv.data$size ,cv.data$dev ,type="b") >plot(cv.data$k ,cv.data$dev ,type="b") >prune.data = prune.misclass(tree1 ,best =9) >plot(prune.data) >text(prune.data,pretty =0) > tree.pred2=predict(prune.data , test2 ,type="class") >table(tree.pred2 ,test.alldefects) test.alldefects tree.pred2 A B C D E F G A 0 0 0 0 0 0 0 B 3 51 0 0 0 2 5 C 0 0 88 0 0 1 8 D 0 0 0 23 0 0 1 E 0 0 0 0 0 0 0 F 2 0 0 0 1 59 28 G 43 6 35 4 22 49 152 >mean(tree.pred2!=test.alldefects) [1] 0.3602058
  • 34. The misclassification error rates obtained from the original and the pruned tree were 34.5% and 36% respectively. The error rate did not increase a lot thus justifying pruning to make the decision tree more readable. Bagging Bagging was used on the dataset to reduce the variance obtained in a decision tree model by averaging the observations and also effectively increasing the training datasets via bootstrapping. > ## bAGGING >set.seed (1) >library(randomForest) >bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE) >yhat.bag = predict (bag.data ,test2) >plot(yhat.bag , test.alldefects) >abline (0,1) >table(yhat.bag, test.alldefects) test.alldefects yhat.bag A B C D E F G A 30 0 0 0 0 5 7 B 0 50 0 0 0 1 0 C 0 0 112 0 0 0 1 D 0 0 0 24 0 0 1 E 1 0 0 0 19 1 2 F 4 0 0 1 2 76 32 G 13 7 11 2 2 28 151 >mean(yhat.bag!=test.alldefects) [1] 0.2075472 The misclassification error rate obtained was 20.75%. Random Forest Random Forest bootstrapping method was applied on the modified dataset to de-correlate the bagged trees thus further reducing the variance.
  • 35. > #randomforest >set.seed (1) >rf =randomForest(alldefects~.,data=train2 , importance =TRUE) >yhat.rf = predict (rf ,test2) >table(yhat.rf, test.alldefects) test.alldefects yhat.rf A B C D E F G A 25 0 0 0 0 6 5 B 1 50 0 0 0 0 4 C 0 0 112 0 0 0 1 D 0 0 0 24 0 0 1 E 0 0 0 0 19 0 3 F 3 1 0 1 2 73 33 G 19 6 11 2 2 32 147 >mean(yhat.rf !=test.alldefects) [1] 0.2281304 The misclassification error rate obtained was 22.81% a bit higher than bagging but it ensures a lower variance which is generally better to implement on future data points. C5.0 An advanced decision tree technique known as C5.0 was also used to model the modified dataset. > #C50 >crx<- data[ sample( nrow( data ) ), ] > X <- crx[,1:27] > y <- crx[,35] >trainx<- X[1:1358,] >trainy<- y[1:1358] >testx<- X[1358:1941,] >testy<- y[1358:1941] >model<- C5.0( trainx, trainy, trials=75 ) > p <- predict( model, testx, type="class" ) >table(p, testy) testy p A B C D E F G A 32 0 0 0 1 2 6 B 1 39 0 0 0 1 4 C 0 0 111 0 0 1 3 D 0 0 0 26 0 2 0 E 1 0 0 0 14 2 2 F 11 0 0 1 0 96 19 G 15 1 7 1 1 31 153 >mean(p != testy) [1] 0.1934932 The misclassification error rate obtained was 19.35%. Support Vector Machines SVM was tried on the training data set for different values of ‘C’ and the best results came out with C=15. > svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15) >summary(svm.fit) Call: svm(formula = alldefects ~ ., data = train2, type = "C", kernel = "polynomial",degree = 3, cost = 15)
  • 36. Parameters: SVM-Type: C-classification SVM-Kernel: polynomial cost: 15 degree: 3 gamma: 0.03703703704 coef.0: 0 Number of Support Vectors: 819 ( 43 225 347 87 74 18 25 ) Number of Classes: 7 Levels: A B C D E F G The plot for SVM on the training data has been shown in below figure. It is a 2-D plot with its axes as Edges_X _Index and Edges_Y_Index. The circular symbols show the data points and the triangles show support vectors. > predicted=predict(svm.fit,test2) >table(predicted,test.alldefects) test.alldefects predicted A B C D E F G A 28 1 0 0 0 6 7 B 0 49 0 0 0 3 6 C 0 1 113 0 0 0 4 D 0 0 0 25 0 0 0 E 0 0 0 0 17 0 3
  • 37. F 6 0 3 1 5 69 53 G 14 6 7 1 1 33 121 >mean(predicted!=test.alldefects) [1] 0.2761578045 The misclassification rate for SVM on testing data is about 27.61%. Artificial Neural Networks For this project artificial neural network model is developed by using different methods. First a model is tried using nnet function in the nnet library in R. The second model is created using the multi level perceptron. This code uses the mlp function provided in the RSNNS library in R. 1st Method In this method the model was built by using artificial neural network using the nnet function from the nnet library in R. Many attributions were tried by changing the number of hidden layers expresses as size in the code. Also rang, decay and matix were changed to get a lower misclassification rate. Finally the best model had 20 hidden layers and other attributes as seen in the code. train.nnet<- nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000) # weights: 707 initial value 2736.596055 iter 10 value 2082.470121 iter 20 value 2006.626658 iter 30 value 1963.775429 iter 40 value 1907.670254 iter 50 value 1901.104216 iter 60 value 1841.389091 iter 70 value 1815.725249 iter 80 value 1804.856698 iter 90 value 1801.382263 iter 100 value 1801.021638 iter 110 value 1797.549455 iter 120 value 1797.305184 iter 130 value 1797.183004 iter 140 value 1796.918336 iter 150 value 1795.256115 iter 160 value 1793.025804 final value 1792.714314 converged test.nnet<-predict(train.nnet,test2,type=("class")) table(test2$alldefects,test.nnet) test.nnet 1 3 7 1 0 5 43 2 0 7 50 3 0 98 25 4 0 0 27 5 0 1 22 6 0 6 105 7 1 22 171 mean(test.nnet!=test2$alldefects) [1] 0.5385935
  • 38. The misclassification rate for ANN on testing data is about 53.8%. 2nd method In this method the artificial neural network model was built by using the mlp function available in RSNNS library in R. This multi layer perceptron takes the predictors, responses, number of hidden layers as input. > model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F) > test.cl(y[-samp,], predict(model, x[-samp,])) cres true 3 7 1 3 42 2 2 52 3 50 63 4 0 25 5 0 16 6 3 110 7 16 200 > test.cl(y[samp,],fitted.values(model)) cres true 3 7 1 5 108 2 6 130 3 119 159 4 0 47 5 4 35 6 4 285 7 33 424 The misclassification rate for ANN on testing data is about 60.01%. It is seen that the misclassification rate of the model built by using artificial neural networks is very high. The least error rate achieved is by using the nnet package, which is about 52%. Logistic regression We developed 7 different logistic models for classification of each type of defect as we noticed that each defect were reliant on a different set of predictors. We aim to develop a hierarchical model in which we detect if a defect is present or not; if present, we end the program (Dataset implies that each steel plate has only one kind of defect); and if absent, we continue and check the presence of the next type of defect. And so on. Logistic regression model for the first type of defect: Pastry >train_pastry=train[,-c(29,30,31,32,33,34,35)] >fix(train_pastry) >test_pastry= test[,-c(29,30,31,32,33,34,35)] >log_pastry = glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orienta tion_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minim um_of_Luminosity,data=train_pastry,family = "binomial") >summary(log_pastry) Call: glm(formula = Pastry ~ LogOfAreas + TypeOfSteel_A300 + Sum_of_Luminosity + Log_X_Index + Square_Index + Orientation_Index + Log_Y_Index + Maximum_of_Luminosity + X_Maximum + X_Minimum + Length_of_Conveyor + Minimum_of_Luminosity, family = "binomial", data = train_pastry) Deviance Residuals:
  • 39. Min 1Q Median 3Q Max -2.01159 -0.26886 -0.05515 0.00000 3.08897 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.178e+01 3.366e+00 -3.500 0.000465 *** LogOfAreas 4.551e+00 2.015e+00 2.258 0.023940 * TypeOfSteel_A300 -5.827e-01 3.077e-01 -1.894 0.058260 . Sum_of_Luminosity 7.952e-06 2.324e-06 3.422 0.000621 *** Log_X_Index 1.377e+01 6.238e+00 2.207 0.027308 * Square_Index -4.378e+00 1.186e+00 -3.692 0.000223 *** Orientation_Index 4.450e+00 1.267e+00 3.512 0.000445 *** Log_Y_Index -9.260e+00 2.309e+00 -4.010 6.07e-05 *** Maximum_of_Luminosity 3.396e-02 9.677e-03 3.510 0.000449 *** X_Maximum -7.516e-01 2.283e-01 -3.292 0.000995 *** X_Minimum 7.521e-01 2.283e-01 3.294 0.000987 *** Length_of_Conveyor 3.830e-03 8.421e-04 4.548 5.41e-06 *** Minimum_of_Luminosity -4.196e-02 8.145e-03 -5.151 2.59e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 763.76 on 1357 degrees of freedom Residual deviance: 427.59 on 1345 degrees of freedom AIC: 453.59 Number of Fisher Scoring iterations: 14 >log_pastry_pred = predict(log_pastry, test_pastry, type ="response") >log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment >log_pastry_pred_y[log_pastry_pred> 0.5]= 1 >table(log_pastry_pred_y, test_pastry[,28]) log_pastry_pred_y 0 1 0 528 33 1 7 15 >mean(log_pastry_pred_y != test_pastry[,28]) [1] 0.06861063 We see that the misclassification error rate is less than 7% which is acceptable for this individual model. We have also cross-validated these results using K-Fold cross validation Technique. ># cross validation >cv_pastry = glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orienta tion_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minim um_of_Luminosity,data=train_pastry,family = "binomial") >cv.glm(train_pastry,cv_pastry,K=10)$delta[1] 1] 0.06969064 The misclassification error rate obtained was 6.97%. Similarly, individual classification models were developed for rest of the defects and the results obtained are tabulated below:
  • 40. The combined accuracy of the hierarchical model would be = (1-0.068)*(1- 0.041)*(1-0.045)*(1-0.0189)*(1-0.031)*(1- 0.16)*(1-0.252) = 0.51= 51% * Since the defects are independent of each other the probabilities of the individual models being right will be multiplied. Random Forest Individual responses were then modeled with random forest to get the respective error rates. > ##randomforest with Pastry only >train_pastry$Pastry=factor(train_pastry$Pastry) >test_pastry$Pastry=factor(test_pastry$Pastry) >rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE) >yhat.rf_pastry = predict (rf_pastry ,test_pastry) >table(yhat.rf_pastry, test_pastry[,28]) yhat.rf_pastry 0 1 0 529 32 1 6 16 >mean(yhat.rf_pastry!=test_pastry[,28]) [1] 0.0651801 The misclassification error rate obtained was 6.52%. Defect Confusion matrix Error rate CV error log_pastry_pred_y 0 1 0 528 33 1 7 15 log_zs_pred_y 0 1 0 511 9 1 15 48 log_ks_pred_y 0 1 0 453 19 1 7 104 log_stains_pred_y 0 1 0 554 9 1 2 18 log_dirt_pred_y 0 1 0 554 12 1 6 11 log_bumps_pred_y 0 1 0 447 68 1 25 43 log_of_pred_y 0 1 0 346 104 1 43 90 0.176 0.069 0.030 0.018 0.020 0.015 0.124 All Faults 0.068 0.041 0.045 0.0189 0.031 0.16 0.252 Pastry Z_Scratch K_Scratch Stains Dirtiness Bumps
  • 41. Similarly, random forest models were developed for other individual defects with the results tabulated below: The combined accuracy of the hierarchical model would be = (1-0.065)*(1- 0.022)*(1-0.0257)*(1-0.005)*(1-0.0189)*(1- 0.111)*(1-0.17) = 0.6417= 64.17% * Since the defects are independent of each other the probabilities of the individual models being right will be multiplied. Principal Component Analysis A dimensional reduction technique was conducted on the dataset to apply the 80/20 rule and extract the “Vital Few” from the “Trivial Many” prediction terms. >#PCA on complete data set >datap=data[,-(28:35)] >fit <- princomp(datap, cor=TRUE) >summary(fit) # print variance accounted for Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.8815442 1.8493487 1.6443472 1.49596665 1.40409954 1.27423486 1.17387303 0.99985628 0.96006830 Defect Confusion matrix Error rate yhat.rf_pastry 0 1 0 529 32 1 6 16 yhat.rf_zs 0 1 0 522 9 1 4 48 yhat.rf_ks 0 1 0 459 14 1 1 109 yhat.rf_stains 0 1 0 556 3 1 0 24 yhat.rf_dirt 0 1 0 558 9 1 2 14 yhat.rf_bumps 0 1 0 460 53 1 12 58 yhat.rf_of 0 1 0 363 73 1 26 121 Other Faults 0.065 0.022 0.0257 0.0051 0.0189 0.111 0.17 Pastry Z_Scratch K_Scratch Stains Dirtiness Bumps
  • 42. Proportion of Variance 0.3075295 0.1266700 0.1001436 0.08288579 0.07301835 0.06013609 0.05103622 0.03702639 0.03413819 Cumulative Proportion 0.3075295 0.4341995 0.5343431 0.61722894 0.69024729 0.75038338 0.80141960 0.83844599 0.87258418 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Standard deviation 0.88369299 0.84586524 0.73974691 0.62701635 0.54299087 0.489154397 0.434472983 0.316363504 Proportion of Variance 0.02892271 0.02649956 0.02026761 0.01456109 0.01091997 0.008861927 0.006991362 0.003706884 Cumulative Proportion 0.90150689 0.92800645 0.94827406 0.96283515 0.97375512 0.982617047 0.989608409 0.993315293 Comp.18 Comp.19 Comp.20 Comp.21 Comp.22 Comp.23 Comp.24 Comp.25 Standard deviation 0.243539500 0.235660669 0.211674618 0.1093589373 0.0837014355 3.693381e-02 2.217081e-02 3.543139e-03 Proportion of Variance 0.002196722 0.002056887 0.001659487 0.0004429399 0.0002594789 5.052246e-05 1.820536e-05 4.649569e-07 Cumulative Proportion 0.995512015 0.997568902 0.999228388 0.9996713283 0.9999308072 9.999813e-01 9.999995e-01 1.000000e+00 Comp.26 Comp.27 Standard deviation 5.417823e-06 1.218954e-08 Proportion of Variance 1.087141e-12 5.503141e-18 Cumulative Proportion 1.000000e+00 1.000000e+00 >plot(fit,type="lines") # scree plot >biplot(fit) From the analysis and the scree plot first 7 principal components were selected as they seemed to explain just over 80% of the variability in the sample space. The principal components were then extracted and were stored in two different files:  One with only the top 7 principal components  Another with the original data and the top 7 principal components combined >axes<- predict(fit, newdata = datap) >fix(axes)
  • 43. > data1=axes[,1:7] >fix(data1) >write.csv(data1,file="pcadata.csv") #data file with the top 7 PCs >data2=data.frame(data,data1) >write.csv(data2,file="comb_data.csv")#data file with the original data and the 7 PCs combined These two data files were used for further modelling. Model Formation with principal components Logistic Regression Logistic Regression was performed on the extracted principal components for individual responses. Logistic regression model for the first type of defect: Pastry (Using the PCs) >#######################logistic regression > #00000000000000000000_pastry >train_pastry=train[,-c(9,10,11,12,13,14,15)] >fix(train_pastry) >test_pastry= test[,-c(9,10,11,12,13,14,15)] >log_pastry = glm(Pastry~.,data=train_pastry,family = "binomial") >summary(log_pastry) Call: glm(formula = Pastry ~ ., family = "binomial", data = train_pastry) Deviance Residuals: Min 1Q Median 3Q Max -1.5477 -0.4039 -0.1326 -0.0243 3.5952 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.51776 0.38758 -11.656 < 2e-16 *** Comp.1 -0.63814 0.11686 -5.461 4.75e-08 *** Comp.2 -1.17226 0.14334 -8.178 2.88e-16 *** Comp.3 -0.37291 0.08243 -4.524 6.07e-06 *** Comp.4 -0.25626 0.08068 -3.176 0.00149 ** Comp.5 0.42288 0.07491 5.645 1.65e-08 *** Comp.6 0.17290 0.09726 1.778 0.07546 . Comp.7 0.40363 0.10328 3.908 9.31e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 763.76 on 1357 degrees of freedom Residual deviance: 541.13 on 1350 degrees of freedom AIC: 557.13 Number of Fisher Scoring iterations: 8 >log_pastry_pred = predict(log_pastry, test_pastry, type ="response") >log_pastry_pred_y = rep(0, length(test_pastry[,8])) # default assignment >log_pastry_pred_y[log_pastry_pred> 0.5]= 1 >table(log_pastry_pred_y, test_pastry[,8]) log_pastry_pred_y 0 1 0 529 44 1 6 4
  • 44. >mean(log_pastry_pred_y != test_pastry[,8]) [1] 0.08576329 > # cross validation >cv_pastry = glm(Pastry~.,data=train_pastry,family = "binomial") >cv.glm(train_pastry,cv_pastry,K=10)$delta[1] [1] 0.06398761 Similarly, individual classification models were developed for rest of the defects and the results obtained are tabulated below: The combined accuracy of the hierarchical model would be = (1-0.0857)*(1- 0.0857)*(1-0.0634)*(1-0.0172)*(1- 0.0446)*(1-0.19)*(1-0.285) = 0.4258= 42.58% * Since the defects are independent of each other the probabilities of the individual models being right will be multiplied. Random Forest The dataset consisting of the principal components was then used with a Random Forest model. > #randomforest with important predictors >set.seed (1) > #randomforest with Pastry only >set.seed (1) >train_pastry$Pastry=factor(train_pastry$Pastry) >test_pastry$Pastry=factor(test_pastry$Pastry) Defect Confusion matrix Error rate CV error log_pastry_pred_y 0 1 0 529 44 1 6 4 log_zs_pred_y 0 1 0 506 30 1 20 27 log_ks_pred_y 0 1 0 452 29 1 8 94 log_stains_pred_y 0 1 0 553 7 1 3 20 log_dirt_pred_y 0 1 0 557 23 1 3 0 log_bumps_pred_y 0 1 0 447 86 1 25 25 log_of_pred_y 0 1 0 344 121 1 45 73 Other Faults 0.285 0.064 0.0525 0.0257 0.0157 0.0224 0.14 0.198 Stains 0.0172 Dirtiness 0.0446 Bumps 0.19 Pastry 0.0857 Z_Scratch 0.0857 K_Scratch 0.0634
  • 45. >rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE) >yhat.rf_pastry = predict (rf_pastry ,test_pastry) >table(yhat.rf_pastry, test_pastry[,8]) yhat.rf_pastry 0 1 0 529 38 1 6 10 >mean(yhat.rf_pastry!=test_pastry[,8]) [1] 0.0754717 The misclassification error rate obtained was 7.54%. With the model implementation showing that random forest was giving better results (and rightly so) it was decided to model all the individual responses with random forest using both the datasets (with only the 7 PCs and the one with the original predictors + 7 PCs). The results obtained are tabulated below. S No. Typeofdefect using 1stsevenPC's using all27predictors Using allpredictorsand7 PC's 1 Pastry(A) 0.075 0.065 0.065 2 Z-scratch(B) 0.046 0.022 0.024 3 K-Scratch(C) 0.036 0.026 0.027 4 Stains(D) 0.012 0.005 0.005 5 Dirtiness(E) 0.019 0.019 0.015 6 Bumps(F) 0.042 0.111 0.127 7 OtherDefects(G) 0.196 0.170 0.168 0.635 0.641 0.631 Misclassificationerrorratesfor individualRANDOMFORESTSwithdifferentpredictionterms Accuracy forthecombinedmodel
  • 46. RESULTS ROC Analysis ROC analysis was conducted on the different multiclass models to aid us in selecting the best model. S No. Modelling Technique Confusion Matrix Missclassification Error rate ROC curve AUC* 1 LDA 0.324 0.065 0.790 2 Decision Tree (After Pruning) 0.360 0.784 3 Bagging 0.208 0.824 4 Random Forest 0.228 0.797 5 SVM 0.276 0.804 6 Neural Network Analysis 0.539 0.605 7 C 5.0 0.194 0.831 Comparision of different Multiclass Models
  • 47. The C5.0 Decision Tree had the best performance on the testing dataset. ROC Analysis was also conducted for the logistic regression and random forest models that were developed for each individual defects, the results are tabulated below: ROC for Individual defects using Logistic Regression: ROC Curves AUC Values Logistic reg AUC Other Faults 0.65 0.91 0.92 0.83 0.73 0.67 0.68 Pastry Z_Scratch K_Scratch Stains Dirtiness Bumps
  • 48. ROC for individual defects using Random Forest ROC Curves AUC Values Random Forest AUC 0.67 0.68 K_Scratch Stains Dirtiness Bumps Other Faults 0.65 0.91 0.92 0.83 0.73 Pastry Z_Scratch
  • 49. CONCLUSION The major takeaways from this project were:  Advanced decision trees such as C5.0 DT and Random Forest are the most efficient techniques in dealing with multiclass anomaly detection using machine learning.  Although modelling for individual defects gives a very high accuracy rate for almost every one of them, the combined hierarchical model that will utilize them in practice will not be as efficient due to the fact that its accuracy is a product of the individual accuracies of the models used in the hierarchical model.  Logistic Regression although a very powerful tool doesn’t seem to be a good fit for multiclass anomaly detection problems due to the fact that the logistic regression model does not predict the type of defect but rather the probability of that defect occurring using the log likelihood function.  SVM also turned out as a good tool for conducting multi classification as its accuracy rate was very high, but we still prefer C5.0 over SVM as it was not the best because of very high number of support vectors.  Artificial Neural Network results on this dataset were not satisfactory with very high misclassification rate. This can be because of many reasons but the major reason is a small dataset. Also there is no method to fine tune the number of hidden layers and a slight change in the number of hidden layer causes significant misclassification. So it can be said that either the model was right value of parameters even after a lot of trial and errors or the model was not trained properly because of the small dataset. Future Scope: The dataset considered in this project uses multiclass classification techniques as the defects are not co-related. Also, as the defects are not co-related some techniques are performed in a multi univariate technique that is using different models for each fault. Now, Multi label classification is where same predictor values are causing 2 or more defects at a time which is not the case in this particular dataset thus not required and hence not used. Thus this project and the results are limited to multi class classification when the faults are not co-related. So for further data if the defects become co-related, multi-label classification has to be used which becomes the future scope of this project.
  • 50. References -Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system", Measurement, vol. 43, no. 4, pp. 513-519, 2010. Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13. A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005. Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science, vol. 8, no. 4, pp. 506-514, 2012. S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013. Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms. Proceedings Of The 23Rd International Conference On Machine Learning - ICML '06. http://dx.doi.org/10.1145/1143844.1143865 Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1, June 2010 Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults detection. International Journal of Technical Research and Applications, 2(4), 127-131. Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969. Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study.Metodoloski zvezki, 1(1), 143-161. Neural Network Primer: Part I" by Maureen Caudill, AI Expert, Feb. 1989 http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf http://saiconference.com/Downloads/SpecialIssueNo10/Paper_3A_comparative_study_of_decision_tree_ID3_and_C4.5 .pdf www.cs.princeton.edu APPENDIX 1. On Original Dataset library(ISLR) library(boot)
  • 51. library(MASS) data=read.csv(file.choose(), header=T) attach(data) data$alldefects="A" for(i in 1:1941) { if ( Z_Scratch[i]==1) {data$alldefects[i]="B"} if ( K_Scratch[i]==1) {data$alldefects[i]="C"} if ( Stains[i]==1) {data$alldefects[i]="D"} if ( Dirtiness[i]==1) {data$alldefects[i]="E"} if ( Bumps[i]==1) {data$alldefects[i]="F"} if ( Other_Faults[i]==1) {data$alldefects[i]="G"} } data$alldefects=factor(data$alldefects) set.seed(1) trainingsample=sample(1:nrow(data), size=0.70*nrow(data)) train=data[trainingsample,] test=data[-trainingsample,] write.csv(train,file="exportedtrainingdata.csv") write.csv(test,file="exportedtestingdata.csv") train2=train[,-(28:34)] test2=test[,-(28:34)] test.alldefects=test2[,28] #LDA lda.model= lda(alldefects~., data = train2) lda_pred= predict(lda.model, test2) table(lda_pred$class, test.alldefects) mean(lda_pred$class!= test.alldefects) mean(lda_pred$class== test.alldefects) lda.cv=lda(alldefects~.,test2, CV=TRUE) table(lda.cv$class,test.alldefects) mean(lda.cv$class!= test.alldefects) predictions <- as.numeric(lda_pred$class, type="response") multiclass.roc(test.alldefects, predictions, plot=T) y=rep(0,length(lda_pred$class)) y[lda_pred$class==test.alldefects]=1 x=rep(0,length(test.alldefects)) x[test.alldefects==test.alldefects]=1 roc(x,y,plot=TRUE,main="LDA") predictions_lda <- as.numeric(lda_pred,type="vote") multiclass.roc(test.alldefects, predictions_lda, plot=T)
  • 52. #qda qda.model= qda(alldefects~., data = train2) qda_pred= predict(qda.model, test2) table(qda_pred$class, test.alldefects) mean(qda_pred$class!= test.alldefects) ##tree library(tree) tree1=tree(train2$alldefects~.,data=train2) plot(tree1) text(tree1 ,pretty =0) tree.pred=predict(tree1,test2,type="class") table(tree.pred ,test.alldefects) mean(tree.pred!=test.alldefects) predictions_tree <- as.numeric(tree.pred,type="response") multiclass.roc(test.alldefects, predictions_tree, plot=T) ##pruning set.seed (1) cv.data =cv.tree(tree1 ,FUN=prune.misclass ) names(cv.data) cv.data par(mfrow =c(1,1)) plot(cv.data$size ,cv.data$dev ,type="b") plot(cv.data$k ,cv.data$dev ,type="b") prune.data = prune.misclass(tree1 ,best =9) plot(prune.data) text(prune.data,pretty =0) tree.pred2=predict(prune.data , test2 ,type="class") table(tree.pred2 ,test.alldefects) mean(tree.pred2!=test.alldefects) predictions_tree <- as.numeric(tree.pred2,type="response") multiclass.roc(test.alldefects, predictions_tree, plot=T) ## bAGGING set.seed (1)
  • 53. bag.data =randomForest(alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickness +Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , mtry=10,importance =TRUE) bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE) bag.data yhat.bag = predict (bag.data ,test2) plot(yhat.bag , test.alldefects) abline (0,1) table(yhat.bag, test.alldefects) mean( yhat.bag!=test.alldefects) predictions_bag <- as.numeric(yhat.bag,type="response") multiclass.roc(test.alldefects, predictions_bag, plot=T) #randomforest set.seed (1) library(randomForest) rf =randomForest(alldefects~.,data=train2 , importance =TRUE) yhat.rf = predict (rf ,test2) table(yhat.rf, test.alldefects) mean( yhat.rf !=test.alldefects) predictions <- as.numeric(predict(rf, test2, type = 'response')) multiclass.roc(test.alldefects, predictions, plot=T) #randomforest with important predictors set.seed (1) library(randomForest) rrf =randomForest(alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickness +Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , importance =TRUE) yhat.rrf = predict (rrf ,test2) table(yhat.rrf, test.alldefects) mean( yhat.rrf !=test.alldefects) #randomforest with Pastry only set.seed (1) train_pastry$Pastry=factor(train_pastry$Pastry) test_pastry$Pastry=factor(test_pastry$Pastry) rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE) yhat.rf_pastry = predict (rf_pastry ,test_pastry) table(yhat.rf_pastry, test_pastry[,28]) mean( yhat.rf_pastry!=test_pastry[,28]) #randomforest with z_scratch only
  • 54. set.seed (1) train_zs$Z_Scratch=factor(train_zs$Z_Scratch) test_zs$Z_Scratch=factor(test_zs$Z_Scratch) rf_zs =randomForest(Z_Scratch~.,data=train_zs, importance =TRUE) yhat.rf_zs = predict (rf_zs ,test_zs) table(yhat.rf_zs, test_zs[,28]) mean( yhat.rf_zs!=test_zs[,28]) #randomforest with K_scratch only set.seed (1) train_ks$K_Scratch=factor(train_ks$K_Scratch) test_ks$K_Scratch=factor(test_ks$K_Scratch) rf_ks =randomForest(K_Scratch~.,data=train_ks, importance =TRUE) yhat.rf_ks = predict (rf_ks ,test_ks) table(yhat.rf_ks, test_ks[,28]) mean( yhat.rf_ks!=test_ks[,28]) #randomforest with stains only set.seed (1) train_stains$Stains=factor(train_stains$Stains) test_stains$Stains=factor(test_stains$Stains) rf_stains =randomForest(Stains~.,data=train_stains, importance =TRUE) yhat.rf_stains = predict (rf_stains ,test_stains) table(yhat.rf_stains, test_stains[,28]) mean( yhat.rf_stains!=test_stains[,28]) #randomforest with dirt only set.seed (1) train_dirt$Dirtiness=factor(train_dirt$Dirtiness) test_dirt$Dirtiness=factor(test_dirt$Dirtiness) rf_dirt =randomForest(Dirtiness~.,data=train_dirt, importance =TRUE) yhat.rf_dirt = predict (rf_dirt ,test_dirt) table(yhat.rf_dirt, test_dirt[,28]) mean( yhat.rf_dirt!=test_dirt[,28]) #randomforest with bumps only set.seed (1) train_bumps$Bumps=factor(train_bumps$Bumps) test_bumps$Bumps=factor(test_bumps$Bumps) rf_bumps =randomForest(Bumps~.,data=train_bumps, importance =TRUE) yhat.rf_bumps = predict (rf_bumps ,test_bumps) table(yhat.rf_bumps, test_bumps[,28]) mean( yhat.rf_bumps!=test_bumps[,28]) #randomforest with other faults only
  • 55. set.seed (1) train_of$Other_Faults=factor(train_of$Other_Faults) test_of$Other_Faults=factor(test_of$Other_Faults) rf_of =randomForest(Other_Faults~.,data=train_of, importance =TRUE) yhat.rf_of = predict (rf_of ,test_of) table(yhat.rf_of, test_of[,28]) mean( yhat.rf_of!=test_of[,28]) rf.cv=randomForest(train_of$Other_Faults~.,data=train_of, CV=TRUE) table(rf.cv$class,train_of[,8]) r = randomForest(alldefects~., data = cadets, importance =TRUE, do.trace = 100) varImpPlot(r) r###################################################################logistic regression ##################################################### #000000000000000000000000000000000000000000000000_pastry train_pastry=train[,-c(29,30,31,32,33,34,35)] fix(train_pastry) test_pastry= test[,-c(29,30,31,32,33,34,35)] attach(train_pastry) attach(test_pastry) log_pastry = glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y _Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train _pastry,family = "binomial") summary(log_pastry) log_pastry_pred = predict(log_pastry, test_pastry, type ="response") log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment log_pastry_pred_y[log_pastry_pred> 0.5]= 1 table(log_pastry_pred_y, test_pastry[,28]) mean(log_pastry_pred_y != test_pastry[,28]) #ROC y=rep(0,length(log_pastry_pred_y)) y[log_pastry_pred_y==1]=1 x=rep(0,length(test_pastry[,28])) x[test_pastry[,28]==1]=1 roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ PASTRY") # cross validation cv_pastry = glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y _Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train _pastry,family = "binomial")
  • 56. cv.glm(train_pastry,cv_pastry,K=10)$delta[1] #000000000000000000000000000000000000000000000000_zs train_zs=train[,-c(28,30,31,32,33,34,35)] fix(train_zs) test_zs= test[,-c(28,30,31,32,33,34,35)] attach(train_zs) attach(test_zs) log_zs = glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum +Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial") summary(log_zs) log_zs_pred = predict(log_zs, test_zs, type ="response") log_zs_pred_y = rep(0, length(test_zs[,28])) # default assignment log_zs_pred_y[log_zs_pred> 0.5]= 1 table(log_zs_pred_y, test_zs[,28]) mean(log_zs_pred_y != test_zs[,28]) #ROC y=rep(0,length(log_zs_pred_y)) y[log_zs_pred_y==1]=1 x=rep(0,length(test_zs[,28])) x[test_zs[,28]==1]=1 roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Z_skratch") #CV log_zs=step(glm(Z_Scratch~.,data=train_zs,family="binomial"),direction="backward") cv_zs = glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum +Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial") cv.glm(train_zs,cv_zs,K=10)$delta[1] #000000000000000000000000000000000000000000000000_ks train_ks=train[,-c(28,29,31,32,33,34,35)] test_ks= test[,-c(28,29,31,32,33,34,35)] attach(train_ks) attach(test_ks) log_ks = glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+ X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index +Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family = "binomial") summary(log_ks) log_ks_pred = predict(log_ks, test_ks, type ="response")
  • 57. log_ks_pred_y = rep(0, length(test_ks[,28])) # default assignment log_ks_pred_y[log_ks_pred> 0.5]= 1 table(log_ks_pred_y, test_ks[,28]) mean(log_ks_pred_y != test_ks[,28]) log_ks=step(glm(K_Scratch~.,data=train_ks,family="binomial"),direction="backward") #ROC y=rep(0,length(log_ks_pred_y)) y[log_ks_pred_y==1]=1 x=rep(0,length(test_ks[,28])) x[test_ks[,28]==1]=1 roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ k_skratch") #CV cv_ks = glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+ X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index +Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family = "binomial") cv.glm(train_ks,cv_ks,K=10)$delta[1] #000000000000000000000000000000000000000000000000_Stains train_stains=train[,-c(28,29,30,32,33,34,35)] test_stains= test[,-c(28,29,30,32,33,34,35)] log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max imum_of_Luminosity+Outside_X_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum, data=train_stains,family="binomial") log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max imum_of_Luminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data= train_stains,family = "binomial") summary(log_stains) log_stains_pred = predict(log_stains, test_stains, type ="response") log_stains_pred_y = rep(0, length(test_of[,28])) # default assignment log_stains_pred_y[log_stains_pred> 0.5]= 1 table(log_stains_pred_y, test_stains[,28]) mean(log_stains_pred_y != test_stains[,28]) log_stains=step(glm(Stains~.,data=train_stains,family="binomial"),direction="backward") #ROC y=rep(0,length(log_stains_pred_y)) y[log_stains_pred_y==1]=1 x=rep(0,length(test_stains[,28]))
  • 58. x[test_stains[,28]==1]=1 roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Stains") # cross validation cv_stains = glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Edges_Index+ LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Maximum_of_L uminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data=train_stain s,family = "binomial") cv.glm(train_stains,cv_stains,K=10)$delta[1] #000000000000000000000000000000000000000000000000_Dirtiness train_dirt=train[,-c(28,29,30,31,33,34,35)] test_dirt= test[,-c(28,29,30,31,33,34,35)] log_dirt = glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial") summary(log_dirt) log_dirt_pred = predict(log_dirt, test_dirt, type ="response") log_dirt_pred_y = rep(0, length(test_dirt[,28])) # default assignment log_dirt_pred_y[log_dirt_pred> 0.5]= 1 table(log_dirt_pred_y, test_dirt[,28]) mean(log_dirt_pred_y != test_dirt[,28]) log_dirt=step(glm(Dirtiness~.,data=train_dirt,family="binomial"),direction="backward") # cross validation cv_dirt = glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial") cv.glm(train_dirt,cv_dirt,K=10)$delta[1] y=rep(0,length(log_dirt_pred_y)) y[log_dirt_pred_y==1]=1 x=rep(0,length(test_dirt[,28])) x[test_dirt[,28]==1]=1 roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Dirtiness") #000000000000000000000000000000000000000000000000_Bumps train_bumps=train[,-c(28,29,30,31,32,34,35)] test_bumps= test[,-c(28,29,30,31,32,34,35)] log_bumps = glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate _Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+ Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial") summary(log_bumps) log_bumps_pred = predict(log_bumps, test_bumps, type ="response")
  • 59. log_bumps_pred_y = rep(0, length(test_of[,28])) # default assignment log_bumps_pred_y[log_bumps_pred> 0.5]= 1 table(log_bumps_pred_y, test_bumps[,28]) mean(log_bumps_pred_y != test_bumps[,28]) log_bumps=step(glm(Bumps~.,data=train_bumps,family="binomial"),direction="backward") # cross validation cv_bumps = glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate _Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+ Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial") cv.glm(train_bumps,cv_bumps,K=10)$delta[1] y=rep(0,length(log_bumps_pred_y)) y[log_bumps_pred_y==1]=1 x=rep(0,length(test_bumps[,28])) x[test_bumps[,28]==1]=1 roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ bumps") #000000000000000000000000000000000000000000000000_otherfaults train_of=train[,-c(28,29,30,31,32,33,35)] test_of= test[,-c(28,29,30,31,32,33,35)] log_of = glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel _Plate_Thickness,data=train_of,family = "binomial") summary(log_of) log_of_pred = predict(log_of, test_of, type ="response") log_of_pred_y = rep(0, length(test_of[,28])) # default assignment log_of_pred_y[log_of_pred> 0.5]= 1 table(log_of_pred_y, test_of[,28]) mean(log_of_pred_y != test_of[,28]) log_of=step(glm(Other_Faults~.,data=train_of,family="binomial"),direction="backward") #ROC y=rep(0,length(log_of_pred_y)) y[log_of_pred_y==1]=1 x=rep(0,length(test_of[,28])) x[test_of[,28]==1]=1 roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ other faults") # cross validation cv_of = glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey
  • 60. or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel _Plate_Thickness,data=train_of,family = "binomial") cv.glm(train_of,cv_of,K=10)$delta[1] ##########PCA #PCA on complete data set datap=data[,-(28:35)] fit <- princomp(datap, cor=TRUE) summary(fit) # print variance accounted for loadings(fit) # pc loadings plot(fit,type="lines") # scree plot fit$scores # the principal components biplot(fit) axes <- predict(fit, newdata = datap) head(axes, 4) fix(axes) data1=axes[,1:7] write.csv(data1,file="pcadata.csv") data2=data.frame(data,data1) write.csv(data2,file="comb_data.csv") #SVM install.packages("e1071") library(e1071) svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15) summary(svm.fit) predicted=predict(svm.fit,test2) table(predicted,test.alldefects) mean(predicted!=test.alldefects) plot(svm.fit,train2,Length_of_Conveyor~X_Maximum,slice=list(X_Perimeter=3,Y_Perimeter=4),svSymbol=1,dataSymbol =2,color.palette=terrain.colors ) # ROC predictions=as.numeric(predicted,type="response") multiclass.roc(test.alldefects,predictions,plot=T,main="ROC for SVM") #ANN library(nnet) train.nnet<-nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000) test.nnet<-predict(train.nnet,test2,type=("class")) table(test2$alldefects,test.nnet) mean(test.nnet!=test2$alldefects) library(pROC) predictions=as.numeric(test.nnet,type="response")
  • 61. multiclass.roc(test2$alldefects,predictions,plot=T,main="ROC for ANN") #read data stored in CSV file. data=read.csv("Steel_faults.csv",header=TRUE) attach(data) x=data[,1:27] # input variables y=data[,28:34] # response variables n=1941 # total number of observations n1=round(n*0.7) # number of observations for training samp=sample(1:n,n1,replace=FALSE) # to select random observation ## following is the userdefined function to obtain confusion matrix test.cl = function(true, pred) { true = max.col(true) cres = max.col(pred) table(true, cres) } ###another package for NNA install.packages("RSNNS") library(RSNNS) model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F) model=mlp(train2[,-28], train2[,28], size=2,linOut=F) #library(devtools) #plot.nnet(model) test.cl(y[-samp,], predict(model, x[-samp,])) #confusion matrix for training data test.cl(y[samp,],fitted.values(model)) #confusion matrix for testing data #C50 crx <- data[ sample( nrow( data ) ), ] X <- crx[,1:27] y <- crx[,35] trainx<- X[1:1358,] trainy <- y[1:1358] testx <- X[1358:1941,] testy <- y[1358:1941] model <- C5.0( trainx, trainy, trials=75 ) p <- predict( model, testx, type="class" ) sum( p == testy ) / length( p ) table(p, testy) mean(p != testy) predictions_c5 <- as.numeric(p,type="response") multiclass.roc(testy, predictions_c5, plot=T)