The document compares techniques for handling incomplete data when using decision trees. It investigates the robustness and accuracy of seven popular techniques when applied to different proportions, patterns and mechanisms of missing data in 21 datasets. The techniques include listwise deletion, decision tree single imputation, expectation maximization single imputation, mean/mode single imputation, and multiple imputation. The results suggest important differences between the techniques, with multiple imputation and decision tree single imputation generally performing better than the others. The choice of technique depends on factors like the amount and nature of the missing data.
A Magnified Application of Deficient Data Using Bolzano Classifierjournal ijrtem
Abstract: Deficient and Inconsistent knowledge are a pervasive and lasting problem in heavy information sets. However, basic accusation is attractive often used to impute missing data, whereas multiple imputation generates right value to replace. Consequently, variety of machine learning (ML) techniques are developed to reprocess the incomplete information. It is estimated multiple imputation of missing information in large datasets and the performance of MI focuses on several unsupervised ML algorithms like mean, median, standard deviation and Supervised ML techniques for probabilistic algorithm like NBI classifier. Such research is carried out employing a comprehensive range of databases, for which missing cases are first filled in by various sets of plausible values to create multiple completed datasets, then standard complete- data operations are applied to each completed dataset, and finally the multiple sets of results combine to generate a single inference. The main aim of this report is to offer general guidelines for selection of suitable data imputation algorithms based on characteristics of the data. Implementing Bolzano Weierstrass theorem in machine learning techniques to assess the functioning of every sequence of rational and irrational number has a monotonic subsequence and specify every sequence always has a finite boundary. For estimate imputation of missing data, the standard machine learning repository dataset has been applied. Tentative effects manifest the proposed approach to have good accuracy and the accuracy measured in terms of percent. Keywords: Bolzano Classifier, Maximum Likelihood, NBI classifier, posterior probability, predictor probability, prior probability, Supervised ML, Unsupervised ML.
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...csandit
Feature selection is a problem closely related to dimensionality reduction. A commonly used
approach in feature selection is ranking the individual features according to some criteria and
then search for an optimal feature subset based on an evaluation criterion to test the optimality.
The objective of this work is to predict more accurately the presence of Learning Disability
(LD) in school-aged children with reduced number of symptoms. For this purpose, a novel
hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The approach follows
a ranking of the symptoms of LD according to their importance in the data domain. Each
symptoms significance or priority values reflect its relative importance to predict LD among the
various cases. Then by eliminating least significant features one by one and evaluating the
feature subset at each stage of the process, an optimal feature subset is generated. The
experimental results shows the success of the proposed method in removing redundant
attributes efficiently from the LD dataset without sacrificing the classification performance.
Benchmarks for Evaluating Anomaly Based Intrusion Detection SolutionsIJNSA Journal
Anomaly-based Intrusion Detection Systems (IDS) have gained increased popularity over time. There are many proposed anomaly-based systems using different Machine Learning (ML) algorithms and techniques, however there is no standard benchmark to compare them based on quantifiable measures. In this paper, we propose a benchmark that measures both accuracy and performance to produce objective metrics that can be used in the evaluation of each algorithm implementation. We then use this benchmark to compare accuracy as well as the performance of four different Anomaly-based IDS solutions based on various ML algorithms. The algorithms include Naive Bayes, Support Vector Machines, Neural Networks, and K-means Clustering. The benchmark evaluation is performed on the popular NSL-KDD dataset. The experimental results show the differences in accuracy and performance between these Anomaly-based IDS solutions on the dataset. The results also demonstrate how this benchmark can be used to create useful metrics for such comparisons.
The document discusses the differences between machine learning (ML), statistical learning, data mining (DM), and automated learning (AL). It argues that while ML and statistical learning developed similar techniques starting in the 1960s, DM emerged in the 1990s from a merging of database research and automated learning. However, industry was much more enthusiastic about adopting DM techniques compared to AL techniques, even though many DM systems are just friendly interfaces of AL systems. The document aims to explain the key differences between DM and AL that led to DM's greater commercial success.
Software Cost Estimation Using Clustering and Ranking SchemeEditor IJMTER
Software cost estimation is an important task in the software design and development process.
Planning and budgeting tasks are carried out with reference to the software cost values. A variety of
software properties are used in the cost estimation process. Hardware, products, technology and
methodology factors are used in the cost estimation process. The software cost estimation quality is
measured with reference to the accuracy levels.
Software cost estimation is carried out using three types of techniques. They are regression based
model, anology based model and machine learning model. Each model has a set of technique for the
software cost estimation process. 11 cost estimation techniques fewer than 3 different categories are
used in the system. The Attribute Relational File Format (ARFF) is used maintain the software product
property values. The ARFF file is used as the main input for the system.
The proposed system is designed to perform the clustering and ranking of software cost
estimation methods. Non overlapped clustering technique is enhanced with optimal centroid estimation
mechanism. The system improves the clustering and ranking process accuracy. The system produces
efficient ranking results on software cost estimation methods.
This document discusses cognitive automation of data science tasks. It proposes that a cognitive system would incorporate knowledge from various structured and unstructured sources, past experiences, and user interactions to guide the machine learning process. It provides examples of how such a system could reason about issues like overfitting and user preferences to select appropriate algorithms and configurations. Key challenges for building such a cognitive system include knowledge representation, knowledge acquisition from multiple sources, and performing probabilistic reasoning on the knowledge to guide the automation process.
1. The document discusses the process of preparing quantitative data for analysis, which includes editing data, handling blank responses, coding responses, categorizing variables, and entering data into software for analysis.
2. It then discusses objectives and methods for analyzing the data, including getting a feel for the data through descriptive statistics, testing the reliability and validity of measures, and testing hypotheses through appropriate statistical tests.
3. Finally, it recommends several software packages that can be used to facilitate data collection, entry, and analysis, and describes how expert systems can help choose the most appropriate statistical tests.
A Magnified Application of Deficient Data Using Bolzano Classifierjournal ijrtem
Abstract: Deficient and Inconsistent knowledge are a pervasive and lasting problem in heavy information sets. However, basic accusation is attractive often used to impute missing data, whereas multiple imputation generates right value to replace. Consequently, variety of machine learning (ML) techniques are developed to reprocess the incomplete information. It is estimated multiple imputation of missing information in large datasets and the performance of MI focuses on several unsupervised ML algorithms like mean, median, standard deviation and Supervised ML techniques for probabilistic algorithm like NBI classifier. Such research is carried out employing a comprehensive range of databases, for which missing cases are first filled in by various sets of plausible values to create multiple completed datasets, then standard complete- data operations are applied to each completed dataset, and finally the multiple sets of results combine to generate a single inference. The main aim of this report is to offer general guidelines for selection of suitable data imputation algorithms based on characteristics of the data. Implementing Bolzano Weierstrass theorem in machine learning techniques to assess the functioning of every sequence of rational and irrational number has a monotonic subsequence and specify every sequence always has a finite boundary. For estimate imputation of missing data, the standard machine learning repository dataset has been applied. Tentative effects manifest the proposed approach to have good accuracy and the accuracy measured in terms of percent. Keywords: Bolzano Classifier, Maximum Likelihood, NBI classifier, posterior probability, predictor probability, prior probability, Supervised ML, Unsupervised ML.
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...csandit
Feature selection is a problem closely related to dimensionality reduction. A commonly used
approach in feature selection is ranking the individual features according to some criteria and
then search for an optimal feature subset based on an evaluation criterion to test the optimality.
The objective of this work is to predict more accurately the presence of Learning Disability
(LD) in school-aged children with reduced number of symptoms. For this purpose, a novel
hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The approach follows
a ranking of the symptoms of LD according to their importance in the data domain. Each
symptoms significance or priority values reflect its relative importance to predict LD among the
various cases. Then by eliminating least significant features one by one and evaluating the
feature subset at each stage of the process, an optimal feature subset is generated. The
experimental results shows the success of the proposed method in removing redundant
attributes efficiently from the LD dataset without sacrificing the classification performance.
Benchmarks for Evaluating Anomaly Based Intrusion Detection SolutionsIJNSA Journal
Anomaly-based Intrusion Detection Systems (IDS) have gained increased popularity over time. There are many proposed anomaly-based systems using different Machine Learning (ML) algorithms and techniques, however there is no standard benchmark to compare them based on quantifiable measures. In this paper, we propose a benchmark that measures both accuracy and performance to produce objective metrics that can be used in the evaluation of each algorithm implementation. We then use this benchmark to compare accuracy as well as the performance of four different Anomaly-based IDS solutions based on various ML algorithms. The algorithms include Naive Bayes, Support Vector Machines, Neural Networks, and K-means Clustering. The benchmark evaluation is performed on the popular NSL-KDD dataset. The experimental results show the differences in accuracy and performance between these Anomaly-based IDS solutions on the dataset. The results also demonstrate how this benchmark can be used to create useful metrics for such comparisons.
The document discusses the differences between machine learning (ML), statistical learning, data mining (DM), and automated learning (AL). It argues that while ML and statistical learning developed similar techniques starting in the 1960s, DM emerged in the 1990s from a merging of database research and automated learning. However, industry was much more enthusiastic about adopting DM techniques compared to AL techniques, even though many DM systems are just friendly interfaces of AL systems. The document aims to explain the key differences between DM and AL that led to DM's greater commercial success.
Software Cost Estimation Using Clustering and Ranking SchemeEditor IJMTER
Software cost estimation is an important task in the software design and development process.
Planning and budgeting tasks are carried out with reference to the software cost values. A variety of
software properties are used in the cost estimation process. Hardware, products, technology and
methodology factors are used in the cost estimation process. The software cost estimation quality is
measured with reference to the accuracy levels.
Software cost estimation is carried out using three types of techniques. They are regression based
model, anology based model and machine learning model. Each model has a set of technique for the
software cost estimation process. 11 cost estimation techniques fewer than 3 different categories are
used in the system. The Attribute Relational File Format (ARFF) is used maintain the software product
property values. The ARFF file is used as the main input for the system.
The proposed system is designed to perform the clustering and ranking of software cost
estimation methods. Non overlapped clustering technique is enhanced with optimal centroid estimation
mechanism. The system improves the clustering and ranking process accuracy. The system produces
efficient ranking results on software cost estimation methods.
This document discusses cognitive automation of data science tasks. It proposes that a cognitive system would incorporate knowledge from various structured and unstructured sources, past experiences, and user interactions to guide the machine learning process. It provides examples of how such a system could reason about issues like overfitting and user preferences to select appropriate algorithms and configurations. Key challenges for building such a cognitive system include knowledge representation, knowledge acquisition from multiple sources, and performing probabilistic reasoning on the knowledge to guide the automation process.
1. The document discusses the process of preparing quantitative data for analysis, which includes editing data, handling blank responses, coding responses, categorizing variables, and entering data into software for analysis.
2. It then discusses objectives and methods for analyzing the data, including getting a feel for the data through descriptive statistics, testing the reliability and validity of measures, and testing hypotheses through appropriate statistical tests.
3. Finally, it recommends several software packages that can be used to facilitate data collection, entry, and analysis, and describes how expert systems can help choose the most appropriate statistical tests.
This document summarizes quantitative data analysis methods for hypothesis testing including measures of central tendency, variability, relative standing, and linear relationships. It also discusses data warehousing, data mining, and operations research techniques. Finally, it covers ethics and security considerations for handling information technology including protecting individual privacy and ensuring data accuracy.
There is an increasing interest in exploiting mobile sensing technologies and machine learning techniques for mental health monitoring and intervention. Researchers have effectively used contextual information, such as mobility, communication and mobile phone usage patterns for quantifying individuals’ mood and wellbeing. In this paper, we investigate the effectiveness of neural network models for predicting users’ level of stress by using the location information collected by smartphones. We characterize the mobility patterns of individuals using the GPS metricspresentedintheliteratureandemploythesemetricsasinputtothenetwork. We evaluate our approach on the open-source StudentLife dataset. Moreover, we discuss the challenges and trade-offs involved in building machine learning models for digital mental health and highlight potential future work in this direction.
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
This paper is written for predicting Bankruptcy using different Machine Learning Algorithms. Whether the company will go bankrupt or not is one of the most challenging and toughest question to answer in the 21st Century. Bankruptcy is defined as the final stage of failure for a firm. A company declares that it has gone bankrupt when at that present moment it does not have enough funds to pay the creditors. It is a global
problem. This paper provides a unique methodology to classify companies as bankrupt or healthy by applying predictive analytics. The prediction model stated in this paper yields better accuracy with standard parameters used for bankruptcy prediction than previously applied prediction methodologies.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Implementation of Prototype Based Credal Classification approach For Enhanced...IRJET Journal
The document discusses a prototype-based credal classification approach for classifying patterns with incomplete data. It aims to provide accurate and efficient classification results for patterns with missing attribute values. The approach uses hierarchical clustering to identify class prototypes from the training data. It then applies an evidential reasoning technique called credal classification to determine the most likely class for each pattern after imputing missing values. Experimental results show the proposed approach performs better in terms of time and memory efficiency compared to other classification methods for incomplete patterns.
A survey of random forest based methods forNikhil Sharma
This document summarizes a research paper that surveys random forest based methods for intrusion detection systems. It begins with an introduction describing the increasing threats to information security with growing network and data usage. It then reviews 35 papers applying random forest techniques to intrusion detection and compares their approaches. These include using random forest for classification, feature selection, and clustering. The document concludes that while random forest methods generally perform well on imbalanced data like intrusion detection, open challenges remain around high data throughput, unlabeled data, and limited benchmark datasets.
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...ijsc
This document discusses challenges in effectively splitting a dataset into training and test sets for machine learning models. It proposes using k-means clustering followed by decision tree analysis to improve the split. K-means clustering groups the data points into clusters to ensure each cluster is well-represented in both the training and test sets. Then a decision tree is used to split the clustered data, aiming to maximize purity within each subset and minimize overlap between training and test data. This approach aims to capture the full domain of the dataset and avoid underrepresenting any parts of the data in either the training or test sets.
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
A lot of classification algorithms are available in the area of data mining for solving the same kind of problem with a little guidance for recommending the most appropriate algorithm to use which gives best results for the dataset at hand. As a way of optimizing the chances of recommending the most appropriate classification algorithm for a dataset, this paper focuses on the different factors considered by data miners and researchers in different studies when selecting the classification algorithms that will yield desired knowledge for the dataset at hand. The paper divided the factors affecting classification algorithms recommendation into business and technical factors. The technical factors proposed are measurable and can be exploited by recommendation software tools.
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAIJCI JOURNAL
This paper proposes a new method that intends on reducing the size of high dimensional dataset by
identifying and removing irrelevant and redundant features. Dataset reduction is important in the case of
machine learning and data mining. The measure of dependence is used to evaluate the relationship
between feature and target concept and or between features for irrelevant and redundant feature removal.
The proposed work initially removes all the irrelevant features and then a minimum spanning tree of
relevant features is constructed using Prim’s algorithm. Splitting the minimum spanning tree based on the
dependency between features leads to the generation of forests. A representative feature from each of the
forests is taken to form the final feature subset
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
Since dealing with high dimensional data is computationally complex and sometimes even intractable, recently several feature reductions methods have been developed to reduce the dimensionality of the data in order to simplify the calculation analysis in various applications such as text categorization, signal processing, image retrieval, gene expressions and etc. Among feature reduction techniques, feature selection is one the most popular methods due to the preservation of the original features. However, most of the current feature selection methods do not have a good performance when fed on imbalanced data sets which are pervasive in real world applications. In this paper, we propose a new unsupervised feature selection method attributed to imbalanced data sets, which will remove redundant features from the original feature space based on the distribution of features. To show the effectiveness of the proposed method, popular feature selection methods have been implemented and compared. Experimental results on the several imbalanced data sets, derived from UCI repository database, illustrate the effectiveness of our proposed methods in comparison with the other compared methods in terms of both accuracy and the number of selected features.
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
This document compares using genetic algorithm (GA) optimization with artificial neural networks (ANN) and support vector machines (SVM) for intrusion detection. It first describes ANN, SVM, and GA techniques. It then applies GA to optimize the feature selection and classification performed by ANN and SVM on the KDD Cup 99 intrusion detection dataset. The results show that GA improved the performance of both ANN and SVM classifiers, achieving 100% detection rates. Specifically, GA-ANN achieved the highest detection rate using the fewest number of features (100% detection using only 18 features), demonstrating GA's greater effectiveness at optimizing ANN compared to SVM.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
A review robot fault diagnosis part ii qualitative models and search strategi...Siva Samy
This document discusses qualitative models and search strategies used in fault diagnosis for industrial robots. It reviews two main types of diagnostic search strategies - topographic search, which uses a template of normal operation to identify faults, and symptomatic search, which looks for symptoms to direct the search to the fault location. It also outlines different forms of qualitative models, including causal models, abstraction hierarchies, fault trees, and qualitative physics, discussing their representation and advantages. The role of fault diagnosis for robot operations is highlighted, along with technical challenges to developing practical supervisory control systems.
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...ijctcm
This document summarizes a study that empirically compares the performance of five machine learning algorithms (J48, BayesNet, OneR, NB, and ZeroR) for intrusion detection on the KDD Cup 99 dataset. The study evaluates the algorithms based on 10 performance criteria and finds that the J48 decision tree algorithm performs best for intrusion detection. It also compares the performance of intrusion detection classifiers using seven feature reduction techniques.
When deep learners change their mind learning dynamics for active learningDevansh16
Abstract:
Active learning aims to select samples to be annotated that yield the largest performance improvement for the learning algorithm. Many methods approach this problem by measuring the informativeness of samples and do this based on the certainty of the network predictions for samples. However, it is well-known that neural networks are overly confident about their prediction and are therefore an untrustworthy source to assess sample informativeness. In this paper, we propose a new informativeness-based active learning method. Our measure is derived from the learning dynamics of a neural network. More precisely we track the label assignment of the unlabeled data pool during the training of the algorithm. We capture the learning dynamics with a metric called label-dispersion, which is low when the network consistently assigns the same label to the sample during the training of the network and high when the assigned label changes frequently. We show that label-dispersion is a promising predictor of the uncertainty of the network, and show on two benchmark datasets that an active learning algorithm based on label-dispersion obtains excellent results.
GROUP FUZZY TOPSIS METHODOLOGY IN COMPUTER SECURITY SOFTWARE SELECTIONijfls
This document presents a group fuzzy TOPSIS methodology for selecting antivirus software. It discusses the risks of malware and importance of antivirus software. It then reviews fuzzy sets, fuzzy numbers, TOPSIS, and group fuzzy TOPSIS decision making approaches. The document applies these techniques to evaluate 7 popular antivirus alternatives based on 7 criteria from expert opinions. Sensitivity analysis is conducted on the results to provide insights for different user needs.
The document discusses data analytics solutions involving machine learning and statistical modeling. It proposes splitting the solution into two parts: 1) applying algorithm techniques and statistical tests to data, and 2) making data-driven decisions using insights, metrics, and innovations. It then provides more details on machine learning techniques like training/testing data sets, and determining the optimal number of neurons in neural networks. Statistical modeling techniques like logistic regression, decision trees, and neural networks are recommended. The document emphasizes comparing different model results to identify ways to improve performance.
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET Journal
This document provides an overview of machine learning algorithms that are commonly used for data science. It discusses both supervised and unsupervised algorithms. For supervised algorithms, it describes decision trees, k-nearest neighbors, and linear regression. Decision trees create a hierarchical structure to classify data, k-nearest neighbors classifies new data based on similarity to existing data, and linear regression finds a linear relationship between variables. Unsupervised algorithms like clustering are also briefly mentioned. The document aims to familiarize data science enthusiasts with basic machine learning techniques.
1. To be successful as an artist, you need to focus on visibility, networking, and identifying goals.
2. Developing an artist statement, CV, and elevator pitch can help with applications and opportunities.
3. Organizing paperwork like applications, correspondence, and tax documents is important for managing your career.
This document contains details of Arushini Padayachee's qualifications, experience, and responsibilities in her role as a Civil Design Technologist at SiVEST Consulting Engineers. She has a Bachelor's Degree in Civil Engineering and over 5 years of experience designing various civil infrastructure projects including stormwater systems, sewer networks, roads, and bulk earthworks. Her duties involve design, drafting, administration, construction management, and providing cost estimates using software such as AutoCAD, Pipemate, and Roadmate. She has experience with a variety of project types and sizes across multiple sectors.
This document summarizes quantitative data analysis methods for hypothesis testing including measures of central tendency, variability, relative standing, and linear relationships. It also discusses data warehousing, data mining, and operations research techniques. Finally, it covers ethics and security considerations for handling information technology including protecting individual privacy and ensuring data accuracy.
There is an increasing interest in exploiting mobile sensing technologies and machine learning techniques for mental health monitoring and intervention. Researchers have effectively used contextual information, such as mobility, communication and mobile phone usage patterns for quantifying individuals’ mood and wellbeing. In this paper, we investigate the effectiveness of neural network models for predicting users’ level of stress by using the location information collected by smartphones. We characterize the mobility patterns of individuals using the GPS metricspresentedintheliteratureandemploythesemetricsasinputtothenetwork. We evaluate our approach on the open-source StudentLife dataset. Moreover, we discuss the challenges and trade-offs involved in building machine learning models for digital mental health and highlight potential future work in this direction.
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
This paper is written for predicting Bankruptcy using different Machine Learning Algorithms. Whether the company will go bankrupt or not is one of the most challenging and toughest question to answer in the 21st Century. Bankruptcy is defined as the final stage of failure for a firm. A company declares that it has gone bankrupt when at that present moment it does not have enough funds to pay the creditors. It is a global
problem. This paper provides a unique methodology to classify companies as bankrupt or healthy by applying predictive analytics. The prediction model stated in this paper yields better accuracy with standard parameters used for bankruptcy prediction than previously applied prediction methodologies.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Implementation of Prototype Based Credal Classification approach For Enhanced...IRJET Journal
The document discusses a prototype-based credal classification approach for classifying patterns with incomplete data. It aims to provide accurate and efficient classification results for patterns with missing attribute values. The approach uses hierarchical clustering to identify class prototypes from the training data. It then applies an evidential reasoning technique called credal classification to determine the most likely class for each pattern after imputing missing values. Experimental results show the proposed approach performs better in terms of time and memory efficiency compared to other classification methods for incomplete patterns.
A survey of random forest based methods forNikhil Sharma
This document summarizes a research paper that surveys random forest based methods for intrusion detection systems. It begins with an introduction describing the increasing threats to information security with growing network and data usage. It then reviews 35 papers applying random forest techniques to intrusion detection and compares their approaches. These include using random forest for classification, feature selection, and clustering. The document concludes that while random forest methods generally perform well on imbalanced data like intrusion detection, open challenges remain around high data throughput, unlabeled data, and limited benchmark datasets.
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...ijsc
This document discusses challenges in effectively splitting a dataset into training and test sets for machine learning models. It proposes using k-means clustering followed by decision tree analysis to improve the split. K-means clustering groups the data points into clusters to ensure each cluster is well-represented in both the training and test sets. Then a decision tree is used to split the clustered data, aiming to maximize purity within each subset and minimize overlap between training and test data. This approach aims to capture the full domain of the dataset and avoid underrepresenting any parts of the data in either the training or test sets.
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
A lot of classification algorithms are available in the area of data mining for solving the same kind of problem with a little guidance for recommending the most appropriate algorithm to use which gives best results for the dataset at hand. As a way of optimizing the chances of recommending the most appropriate classification algorithm for a dataset, this paper focuses on the different factors considered by data miners and researchers in different studies when selecting the classification algorithms that will yield desired knowledge for the dataset at hand. The paper divided the factors affecting classification algorithms recommendation into business and technical factors. The technical factors proposed are measurable and can be exploited by recommendation software tools.
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAIJCI JOURNAL
This paper proposes a new method that intends on reducing the size of high dimensional dataset by
identifying and removing irrelevant and redundant features. Dataset reduction is important in the case of
machine learning and data mining. The measure of dependence is used to evaluate the relationship
between feature and target concept and or between features for irrelevant and redundant feature removal.
The proposed work initially removes all the irrelevant features and then a minimum spanning tree of
relevant features is constructed using Prim’s algorithm. Splitting the minimum spanning tree based on the
dependency between features leads to the generation of forests. A representative feature from each of the
forests is taken to form the final feature subset
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
Since dealing with high dimensional data is computationally complex and sometimes even intractable, recently several feature reductions methods have been developed to reduce the dimensionality of the data in order to simplify the calculation analysis in various applications such as text categorization, signal processing, image retrieval, gene expressions and etc. Among feature reduction techniques, feature selection is one the most popular methods due to the preservation of the original features. However, most of the current feature selection methods do not have a good performance when fed on imbalanced data sets which are pervasive in real world applications. In this paper, we propose a new unsupervised feature selection method attributed to imbalanced data sets, which will remove redundant features from the original feature space based on the distribution of features. To show the effectiveness of the proposed method, popular feature selection methods have been implemented and compared. Experimental results on the several imbalanced data sets, derived from UCI repository database, illustrate the effectiveness of our proposed methods in comparison with the other compared methods in terms of both accuracy and the number of selected features.
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
This document compares using genetic algorithm (GA) optimization with artificial neural networks (ANN) and support vector machines (SVM) for intrusion detection. It first describes ANN, SVM, and GA techniques. It then applies GA to optimize the feature selection and classification performed by ANN and SVM on the KDD Cup 99 intrusion detection dataset. The results show that GA improved the performance of both ANN and SVM classifiers, achieving 100% detection rates. Specifically, GA-ANN achieved the highest detection rate using the fewest number of features (100% detection using only 18 features), demonstrating GA's greater effectiveness at optimizing ANN compared to SVM.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
A review robot fault diagnosis part ii qualitative models and search strategi...Siva Samy
This document discusses qualitative models and search strategies used in fault diagnosis for industrial robots. It reviews two main types of diagnostic search strategies - topographic search, which uses a template of normal operation to identify faults, and symptomatic search, which looks for symptoms to direct the search to the fault location. It also outlines different forms of qualitative models, including causal models, abstraction hierarchies, fault trees, and qualitative physics, discussing their representation and advantages. The role of fault diagnosis for robot operations is highlighted, along with technical challenges to developing practical supervisory control systems.
An Empirical Comparison and Feature Reduction Performance Analysis of Intrusi...ijctcm
This document summarizes a study that empirically compares the performance of five machine learning algorithms (J48, BayesNet, OneR, NB, and ZeroR) for intrusion detection on the KDD Cup 99 dataset. The study evaluates the algorithms based on 10 performance criteria and finds that the J48 decision tree algorithm performs best for intrusion detection. It also compares the performance of intrusion detection classifiers using seven feature reduction techniques.
When deep learners change their mind learning dynamics for active learningDevansh16
Abstract:
Active learning aims to select samples to be annotated that yield the largest performance improvement for the learning algorithm. Many methods approach this problem by measuring the informativeness of samples and do this based on the certainty of the network predictions for samples. However, it is well-known that neural networks are overly confident about their prediction and are therefore an untrustworthy source to assess sample informativeness. In this paper, we propose a new informativeness-based active learning method. Our measure is derived from the learning dynamics of a neural network. More precisely we track the label assignment of the unlabeled data pool during the training of the algorithm. We capture the learning dynamics with a metric called label-dispersion, which is low when the network consistently assigns the same label to the sample during the training of the network and high when the assigned label changes frequently. We show that label-dispersion is a promising predictor of the uncertainty of the network, and show on two benchmark datasets that an active learning algorithm based on label-dispersion obtains excellent results.
GROUP FUZZY TOPSIS METHODOLOGY IN COMPUTER SECURITY SOFTWARE SELECTIONijfls
This document presents a group fuzzy TOPSIS methodology for selecting antivirus software. It discusses the risks of malware and importance of antivirus software. It then reviews fuzzy sets, fuzzy numbers, TOPSIS, and group fuzzy TOPSIS decision making approaches. The document applies these techniques to evaluate 7 popular antivirus alternatives based on 7 criteria from expert opinions. Sensitivity analysis is conducted on the results to provide insights for different user needs.
The document discusses data analytics solutions involving machine learning and statistical modeling. It proposes splitting the solution into two parts: 1) applying algorithm techniques and statistical tests to data, and 2) making data-driven decisions using insights, metrics, and innovations. It then provides more details on machine learning techniques like training/testing data sets, and determining the optimal number of neurons in neural networks. Statistical modeling techniques like logistic regression, decision trees, and neural networks are recommended. The document emphasizes comparing different model results to identify ways to improve performance.
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET Journal
This document provides an overview of machine learning algorithms that are commonly used for data science. It discusses both supervised and unsupervised algorithms. For supervised algorithms, it describes decision trees, k-nearest neighbors, and linear regression. Decision trees create a hierarchical structure to classify data, k-nearest neighbors classifies new data based on similarity to existing data, and linear regression finds a linear relationship between variables. Unsupervised algorithms like clustering are also briefly mentioned. The document aims to familiarize data science enthusiasts with basic machine learning techniques.
1. To be successful as an artist, you need to focus on visibility, networking, and identifying goals.
2. Developing an artist statement, CV, and elevator pitch can help with applications and opportunities.
3. Organizing paperwork like applications, correspondence, and tax documents is important for managing your career.
This document contains details of Arushini Padayachee's qualifications, experience, and responsibilities in her role as a Civil Design Technologist at SiVEST Consulting Engineers. She has a Bachelor's Degree in Civil Engineering and over 5 years of experience designing various civil infrastructure projects including stormwater systems, sewer networks, roads, and bulk earthworks. Her duties involve design, drafting, administration, construction management, and providing cost estimates using software such as AutoCAD, Pipemate, and Roadmate. She has experience with a variety of project types and sizes across multiple sectors.
It is the global level fact that „School working day‟ mean students „Face become shrink‟. „School holiday‟ means students „face become cheerful‟. Holiday means Jolly Day?...
This document provides information about Delphi, a tier 1 automotive parts manufacturing company, and SAP systems used at Delphi. It discusses Delphi's 5 divisions that manufacture different automotive components. It also summarizes the global common template and non-global common templates used in Delphi's SAP systems, including different production systems, development environments, and SAP versions used. Finally, it mentions plans to consolidate various SAP systems under the main PN1 system and lists some functional and technical SAP streams.
Tiga kalimat ringkasan dokumen tersebut adalah:
Dokumen tersebut membahas tentang kehidupan anak kos dari segi positif dan negatif, mulai dari mandirinya, masalah keuangan, pola makan, hingga kabar negatif seperti pergaulan bebas. Dokumen tersebut juga menyarankan solusi untuk menjaga diri sebagai anak kos.
When parents strive to give their children the best of everything at an early age, they are sowing seeds for materially insatiable monsters that are prone to sloth, apathy, avarice and fear.
Don’t stand in self- defense as yet. I have proof. As I sit in my counselor’s chair day after day I encounter an altogether a new disorder that I have come to label as- Parent Induced Wastefulness (PIW).
- Choudhury Pritam Das is seeking a career opportunity and provides his contact information and career objective.
- He has a B.Tech degree from Ajay Binay Institute of Technology with over 6 years of experience in industrial automation.
- He has expertise in programming PLCs, SCADAs, HMIs, and other automation tools and has experience interfacing these systems with devices like sensors, valves and drives.
- He also provides details on his academic qualifications, projects, training, software skills, strengths and personal information to support his job application.
This document discusses a new technique called "hedging predictions" that provides quantitative measures of accuracy and reliability for predictions made by machine learning algorithms. Hedging predictions complement predictions from algorithms like support vector machines with measures that are provably valid under the assumption that training data is generated independently from the same probability distribution. Validity is achieved automatically, and the goal of hedged prediction is to produce accurate predictions by leveraging powerful machine learning techniques.
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...CSCJournals
Missing data are often encountered in data sets and a common problem for researchers in different fields of research. There are many reasons why observations may have missing values. For instance, some respondents may not report some of the items for some reason. The existence of missing data brings difficulties to the conduct of statistical analyses, especially when there is a large fraction of data which are missing. Many methods have been developed for dealing with missing data, numeric or categorical. The performances of imputation methods on missing data are key in choosing which imputation method to use. They are usually evaluated on how the missing data method performs for inference about target parameters based on a statistical model. One important parameter is the expected imputation accuracy rate, which, however, relies heavily on the assumptions of missing data type and the imputation methods. For instance, it may require that the missing data is missing completely at random. The goal of the current study was to develop a two-step algorithm to evaluate the performances of imputation methods for missing categorical data. The evaluation is based on the re-imputation accuracy rate (RIAR) introduced in the current work. A simulation study based on real data is conducted to demonstrate how the evaluation algorithm works.
Machine Learning Approaches and its Challengesijcnes
Real world data sets considerably is not in a proper manner. They may lead to have incomplete or missing values. Identifying a missed attributes is a challenging task. To impute the missing data, data preprocessing has to be done. Data preprocessing is a data mining process to cleanse the data. Handling missing data is a crucial part in any data mining techniques. Major industries and many real time applications hardly worried about their data. Because loss of data leads the company growth goes down. For example, health care industry has many datas about the patient details. To diagnose the particular patient we need an exact data. If these exist missing attribute values means it is very difficult to retain the datas. Considering the drawback of missing values in the data mining process, many techniques and algorithms were implemented and many of them not so efficient. This paper tends to elaborate the various techniques and machine learning approaches in handling missing attribute values and made a comparative analysis to identify the efficient method.
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCEIJCI JOURNAL
Class imbalance exists in many classification problems, and since the data is designed for accuracy,
imbalance in data classes can lead to classification challenges with a few classes having higher
misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small
amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This
paper provides a comprehensive overview of research in the field of imbalanced data classification. The
discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid
methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas,
strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed,
along with strategies to address them. It is convenient for researchers to choose the appropriate method
according to their needs.
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
In this paper we focus on some techniques for solving data mining tasks such as: Statistics, Decision Trees and Neural
Networks. The new approach has succeed in defining some new criteria for the evaluation process, and it has obtained valuable results
based on what the technique is, the environment of using each techniques, the advantages and disadvantages of each technique, the
consequences of choosing any of these techniques to extract hidden predictive information from large databases, and the methods of
implementation of each technique. Finally, the paper has presented some valuable recommendations in this field.
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IJDKP
This document summarizes an algorithm called Principal Component Outlier Detection (PrCmpOut) for identifying outliers in high-dimensional molecular descriptor datasets. PrCmpOut uses principal component analysis to transform the data into a lower-dimensional space, where it can more efficiently detect outliers using robust estimators of location and covariance. The properties of PrCmpOut are analyzed and compared to other robust outlier detection methods through simulation studies using a dataset of oxazoline and oxazole molecular descriptors. Numerical results show PrCmpOut performs well at outlier detection in high-dimensional data.
1. The document discusses machine learning and provides an overview of the seven steps of machine learning including gathering data, preparing data, choosing a model, training the model, evaluating the model, tuning hyperparameters, and making predictions.
2. It describes tips for data preparation such as exploring data for trends and issues, formatting data consistently, and handling missing values, outliers, and imbalanced data.
3. Techniques for outlier removal are discussed including clustering-based, nearest-neighbor based, density-based, graphical, and statistical approaches. Limitations and challenges of outlier removal are noted.
The document provides an overview of machine learning and discusses several key concepts:
1) It outlines the seven steps of machine learning including gathering data, preparing data, choosing a model, training the model, evaluating the model, tuning hyperparameters, and making predictions.
2) It discusses different machine learning tasks such as classification, regression, clustering, and dimensionality reduction. Common classification algorithms like decision trees, random forests, neural networks, and support vector machines are also covered.
3) The concepts of overfitting and underfitting are explained as well as techniques for preparing data, such as handling missing data, removing outliers, and feature engineering.
Top 20 Data Science Interview Questions and Answers in 2023.pdfAnanthReddy38
Here are the top 20 data science interview questions along with their answers:
What is data science?
Data science is an interdisciplinary field that involves extracting insights and knowledge from data using various scientific methods, algorithms, and tools.
What are the different steps involved in the data science process?
The data science process typically involves the following steps:
a. Problem formulation
b. Data collection
c. Data cleaning and preprocessing
d. Exploratory data analysis
e. Feature engineering
f. Model selection and training
g. Model evaluation and validation
h. Deployment and monitoring
What is the difference between supervised and unsupervised learning?
Supervised learning involves training a model on labeled data, where the target variable is known, to make predictions or classify new instances. Unsupervised learning, on the other hand, deals with unlabeled data and aims to discover patterns, relationships, or structures within the data.
What is overfitting, and how can it be prevented?
Overfitting occurs when a model learns the training data too well, resulting in poor generalization to new, unseen data. To prevent overfitting, techniques like cross-validation, regularization, and early stopping can be employed.
What is feature engineering?
Feature engineering involves creating new features from the existing data that can improve the performance of machine learning models. It includes techniques like feature extraction, transformation, scaling, and selection.
Explain the concept of cross-validation.
Cross-validation is a resampling technique used to assess the performance of a model on unseen data. It involves partitioning the available data into multiple subsets, training the model on some subsets, and evaluating it on the remaining subset. Common types of cross-validation include k-fold cross-validation and holdout validation.
What is the purpose of regularization in machine learning?
Regularization is used to prevent overfitting by adding a penalty term to the loss function during model training. It discourages complex models and promotes simpler ones, ultimately improving generalization performance.
What is the difference between precision and recall?
Precision is the ratio of true positives to the total predicted positives, while recall is the ratio of true positives to the total actual positives. Precision measures the accuracy of positive predictions, whereas recall measures the coverage of positive instances.
Explain the term “bias-variance tradeoff.”
The bias-variance tradeoff refers to the relationship between a model’s bias (error due to oversimplification) and variance (error due to sensitivity to fluctuations in the training data). Increasing model complexity reduces bias but increases variance, and vice versa. The goal is to find the right balance that minimizes overall error.
Running Head Data Mining in The Cloud .docxhealdkathaleen
Running Head: Data Mining in The Cloud 1
Data Mining in The Cloud 13
Data Mining in The Cloud
Student’s Name:
Institution:
Instructor:
Big data mining on the cloud
Big data mining techniques
Abstract.
Management and analysis of data is becoming a nightmare in every organization day by day. This is because there is flooding of data. This data can only be analyzed by using Information Governance and big data mining techniques. This paper aims to look at some of the big data mining techniques which can be used to analyze data in organizations with flooding of data. It will also show how information governance support big data. The paper begins with an overview of data mining, narrows down to the big data mining techniques and then finally the ways in which Information governance support big data.
Introduction
Data mining is the way toward looking at tremendous amounts of information so as to make a factually likely expectation. Data mining can be utilized, for example, to recognize when high going through clients connect with your business, to figure out which advancements succeed, or investigate the effect of the climate on your business. Information mining standards have been around for a long time related to information distribution centers, and have now taken on more noteworthy pervasiveness with the appearance of Enormous Information. Information examination and the development in both organized and unstructured information has likewise incited information mining strategies to change, since organizations are currently managing bigger informational collections with progressively fluctuated substance (Khan, Anjum, Soomro and Tahir, 2015). Also, man-made brainpower and AI are mechanizing the procedure of data mining.
Despite the methods applied, data mining involves three steps. These steps include exploration, modelling and deployment. The data must first be prepared and sorted out to is needed and what is not needed. This helps one to do away with useless data or even duplicates and ensuring that the final data that is sampled is the only one that is crucial and needed the most. Creating the statistical models with the aim of determining the one which will give the best and most accurate forecasting. This however can consume a lot of time as there are various and different models to the same data set which is applied severally to the sets of data respectively and finally analysis of data should be done. Lastly, in the last step, the model has to be tested against the old and the current data (Milani & Navimipour, 2017). This helps an individual to determine the results which he or she should expect in future.
Big data mining techniques
Data mining is a very significant and effective method when proper techniques are ap ...
This document discusses random forest machine learning algorithms and their use in predictive modeling. It provides context on random forests, including that they perform well for both classification and regression tasks, are less prone to overfitting than decision trees, and provide good predictive accuracy while also being interpretable. The document then discusses preprocessing methods like stemming, removing punctuation and stop words that can be applied before using natural language processing algorithms. It highlights the advantages of random forests, such as their ability to handle different data types, parallelizability, and stability. It also notes limitations like lack of interpretability for some users and potential for overfitting on some data sets.
T OWARDS A S YSTEM D YNAMICS M ODELING M E- THOD B ASED ON DEMATELijcsit
This document proposes a new method for constructing system dynamics models that combines the Decision Making Trial and Evaluation Laboratory (DEMATEL) technique with system dynamics modeling. DEMATEL is first used to systematically define and quantify causal relationships between variables in a system. The results from DEMATEL, including impact relation maps and a total influence matrix, are then used to derive the causal loop diagram and define variable weights in the stock-flow chart equations of the system dynamics model. This combined method aims to overcome limitations and subjectivity in traditional system dynamics modeling that relies solely on decision makers' mental models.
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Jin Young Kim
검색 및 추천 시스템의 사회적 역할이 커지면서, 그 결과의 공정성 역시 최근 관심사로 대두되었다. 본 발표에서는 검색 및 추천시스템의 공정성 이슈 및 그 해법을 다룬다. 공정한 검색 및 추천 결과를 정의하는 다양한 방법, 공정성의 결여가 미치는 자원 배분 및 스테레오타이핑 문제, 그리고 검색 및 추천시스템 개발의 각 단계별로 어떤 해결책이 있는지를 최신 연구 중심으로 살펴본다. 마지막으로 실제 공정한 시스템 개발을 위한 실무적인 고려사항을 다룬다.
New hybrid ensemble method for anomaly detection in data science IJECEIAES
Anomaly detection is a significant research area in data science. Anomaly detection is used to find unusual points or uncommon events in data streams. It is gaining popularity not only in the business world but also in different of other fields, such as cyber security, fraud detection for financial systems, and healthcare. Detecting anomalies could be useful to find new knowledge in the data. This study aims to build an effective model to protect the data from these anomalies. We propose a new hyper ensemble machine learning method that combines the predictions from two methodologies the outcomes of isolation forest-k-means and random forest using a voting majority. Several available datasets, including KDD Cup-99, Credit Card, Wisconsin Prognosis Breast Cancer (WPBC), Forest Cover, and Pima, were used to evaluate the proposed method. The experimental results exhibit that our proposed model gives the highest realization in terms of receiver operating characteristic performance, accuracy, precision, and recall. Our approach is more efficient in detecting anomalies than other approaches. The highest accuracy rate achieved is 99.9%, compared to accuracy without a voting method, which achieves 97%.
A Novel Performance Measure For Machine Learning ClassificationKarin Faust
This document provides a comprehensive overview of performance measures used to evaluate machine learning classification models. It discusses measures based on confusion matrices (e.g. accuracy, recall, precision), probability deviations (e.g. mean absolute error, brier score, cross entropy), and discriminatory power (e.g. ROC curves, AUC). The document highlights limitations of individual measures and proposes a framework to construct a new composite measure based on voting results from existing measures to better evaluate model performance.
A NOVEL PERFORMANCE MEASURE FOR MACHINE LEARNING CLASSIFICATIONIJMIT JOURNAL
Machine learning models have been widely used in numerous classification problems and performance measures play a critical role in machine learning model development, selection, and evaluation. This paper covers a comprehensive overview of performance measures in machine learning classification. Besides, we proposed a framework to construct a novel evaluation metric that is based on the voting results of three performance measures, each of which has strengths and limitations. The new metric can be proved better than accuracy in terms of consistency and discriminancy.
A Novel Performance Measure for Machine Learning ClassificationIJMIT JOURNAL
Machine learning models have been widely used in numerous classification problems and performance measures play a critical role in machine learning model development, selection, and evaluation. This paper covers a comprehensive overview of performance measures in machine learning classification. Besides, we proposed a framework to construct a novel evaluation metric that is based on the voting results of three performance measures, each of which has strengths and limitations. The new metric can be proved better than accuracy in terms of consistency and discriminancy.
Correlation of artificial neural network classification and nfrs attribute fi...eSAT Journals
Abstract
Mostly 5 to 15% of the women in the stage of reproduction face the disease called Polycystic Ovarian Syndrome (PCOS) which is the multifaceted, heterogeneous and complex. The long term consequences diseases like endometrial hyperplasia, type 2 diabetes mellitus and coronary disease are caused by the polycystic ovaries, chronic anovulation and hyperandrogenism are characterized with the resistance of insulin and the hypertension, abdominal obesity and dyslipidemia and hyperinsulinemia are called as Metabolic syndrome (frequent metabolic traits) The above cause the common disease called Anovulatory infertility. Computer based information along with advanced Data mining techniques are used for appropriate results. Classification is a classic data mining task, with roots in machine learning. Naïve Bayesian, Artificial Neural Network, Decision Tree, Support Vector Machines are the classification tasks in the data mining. Feature selection methods involve generation of the subset, evaluation of each subset, criteria for stopping the search and validation procedures. The characteristics of the search method used are important with respect to the time efficiency of the feature selection methods. PCA (Principle Component Analysis), Information gain Subset Evaluation, Fuzzy rough set evaluation, Correlation based Feature Selection (CFS) are some of the feature selection techniques, greedy first search, ranker etc are the search algorithms that are used in the feature selection. In this paper, a new algorithm which is based on Fuzzy neural subset evaluation and artificial neural network is proposed which reduces the task of classification and feature selection separately. This algorithm combines the neural fuzzy rough subset evaluation and artificial neural network together for the better performance than doing the tasks separately.
Keywords: ANN, SVM, PCA, CFS
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
1. An Empirical Comparison of Techniques for Handling
Incomplete Data When Using Decision Trees
BHEKISIPHO TWALA,
Brunel Software Engineering Research Centre,
School of Information Systems, Computing and Mathematics,
Brunel University, Uxbridge, Middlesex UB8 3PH
___________________________________________________________________________________
OBJECTIVE: Increasing the awareness of how incomplete data affects learning and classification
accuracy has led to increasing numbers of missing data techniques. This paper investigates the
robustness and accuracy of seven popular techniques for tolerating incomplete training and test data on
a single attribute and on all attributes for different proportions and mechanisms of missing data on
resulting tree-based models.
METHOD: The seven missing data techniques were compared by artificially simulating different
proportions, patterns, and mechanisms of missing data using twenty one complete (i.e. with no missing
values) datasets obtained from the UCI repository of machine learning databases. A 4-way repeated
measures design was employed to analyze the data.
RESULTS: The simulation results suggest important differences. All methods have their strengths and
weaknesses. However, listwise deletion is substantially inferior to the other six techniques while
multiple imputation, which utilizes the Expectation Maximization algorithm, represents a superior
approach to handling incomplete data. Decision tree single imputation and surrogate variables splitting
are more severely impacted by missing values distributed among all attributes compared to when they
are only on a single attribute. Otherwise, the imputation--versus model-based imputation procedures
gave reasonably good results although some discrepancies remained
CONCLUSIONS: Different techniques for addressing missing values when using decision trees can
give substantially diverse results, and must be carefully considered to protect against biases and
spurious findings. Multiple imputation should always be used, especially if the data contain many
missing values. If few values are missing, any of the missing data techniques might be considered. The
choice of technique should be guided by the proportion, pattern and mechanisms of missing data.
However, the use of older techniques like listwise deletion and mean or mode single imputation is no
longer justifiable given the accessibility and ease of use of more advanced techniques such as multiple
imputation.
Keywords: incomplete data, machine learning, decision trees, classification accuracy
___________________________________________________________________________________
1. INTRODUCTION
Machine learning (ML) algorithms have proven to be of great practical value in a variety of application
domains. Unfortunately, such algorithms generally operate in environments that are fraught with
imperfections. One form of imperfection is incompleteness of data. This is currently an issue faced by
machine learning and statistical pattern recognition researchers who use real-world databases. One
primary concern of classifier learning is prediction accuracy. Handling incomplete data (data unavailable
1
2. or unobserved for any number of reasons) is an important issue for classifier learning since incomplete
data in either the training data or test (unknown) data may not only impact interpretations of the data or
the models created from the data but may also affect the prediction accuracy of learned classifiers. Rates
of less than 1% missing data are generally considered trivial, 1-5% manageable. However, 5-15% require
sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation
[Pyle, 1999].
There are two common solutions to the problem of incomplete data that are currently applied by
researchers. The first includes omitting the instances having missing values (i.e. listwise deletion),
which does not only seriously reduce the sample sizes available for analysis but also ignores the
mechanism causing the missingness. The problem with a smaller sample size is that it gives greater
possibility of a non-significant result, i.e., the larger the sample the greater the statistical power of the
test. The second solution imputes (or estimate) missing values from the existing data. The major
weakness of single imputation methods is that they underestimate uncertainty and so yield invalid tests
and confidence intervals, since the estimated values are derived from the ones actually present [Little
and Rubin, 1987].
The two most common tasks when dealing with missing values, thus, choosing a missing data
technique, are to investigate the pattern and mechanism of missingness to get an idea of the process that
could have generated the missing data, and to produce sound estimates of the parameters of interest,
despite the fact that the data are incomplete. In other words, the potential impact missing data can have
is dependent on the pattern and mechanism leading to the nonresponse. In addition, the choice of how to
deal with missing data should also be based on the percentage of data that are missing and the size of
the sample.
Robustness has twofold meaning in terms of dealing with missing values when using decision trees.
The toleration of missing values in training data is one, and the toleration of missing data in test data is
the other. Although the problem of incomplete data has been treated adequately in various real world
datasets, there are rather few published works or empirical studies concerning the task of assessing
learning and classification accuracy of missing data techniques (MDTs) using supervised ML
algorithms such as decision trees [Breiman et al., 1984; Quinlan, 1993].
The following section briefly discusses missing data patterns and mechanisms that lead to the
introduction of missing values in datasets. Section 3 presents details of seven MDTs that are used in this
paper. Section 4 empirically evaluates the robustness and accuracy of the eight MDTs on twenty one
machine learning domains. We close with a discussion and conclusions, and then directions for future
research.
2. PATTERNS AND MECHANISMS OF MISSING DATA
The pattern simply defines which values in the data set are observed and which are missing. The three
most common patterns of nonresponse in data are univariate, monotonic and arbitrary. When missing
values are confined to a single variable we have a univariate pattern; monotonic pattern occurs if a
subject, say Y j , is missing then the other variables, say Yj 1 ,..., Yp , are missing as well or when the data
matrix can be divided into observed and missing parts with a “staircase” line dividing them; arbitrary
patterns occur when any set of variables may be missing for any unit.
2
3. The law generating the missing values seems to be the most important task since it facilitates how the
missing values could be estimated more efficiently. If data are missing completely at random (MCAR) or
missing at random (MAR), we say that missingness is ignorable. For example, suppose that you are
modelling software defects as a function of development time. If missingness is not related to the missing
values of defect rate itself and also not related on the values of development time, such data is considered
to be MCAR. For example, there may be no particular reason why some project managers told you their
defect rates and others did not. Furthermore, software defects may not be identified or detected due to a
given specific development time. Such data are considered to be MAR. MAR essentially says that the
cause of missing data (software defects) may be dependent on the observed data (development time) but
must be independent of the missing value that would have been observed. It is a less restrictive model
than MCAR, which says that the missing data cannot be dependent on either the observed or the missing
data. MAR is also a more realistic assumption for data to meet, but not always tenable. The more relevant
and related attributes one can include in statistical models, the more likely it is that the MAR assumption
will be met. For data that is informatively missing (IM) or not missing at random (NMAR) then the
mechanism is not only non-random and not predictable from the other variables in the dataset but cannot
be ignored, i.e., we have non ignorable missingness [Little and Rubin, 1987; Schafer, 1997]. In contrast
to the MAR condition outlined above, IM arise when the probability that defect rate is missing depends
on the unobserved value of defect rate itself. For example, software project managers may be less likely
to reveal projects with high defect rates. Since the pattern of IM data is not random, it is not amenable to
common MDTs and there are no statistical means to alleviate the problem.
MCAR is the most restrictive of the three conditions and in practice it is usually difficult to meet the
MCAR assumption. Generally you can test whether MCAR conditions can be met by comparing the
distribution of the observed data between the respondents and non-respondents. In other words, data can
provide evidence against MCAR. However, data cannot generally distinguish between MAR and IM
without distributional assumptions, unless the mechanisms is well understood. For example, right
censoring (or suspensions) is IM but is in some sense known. An item, or unit, which is removed from a
reliability test prior to failure or a unit which is in the field and is still operating at the time the
reliability of these units is to be determined is called a suspended item or right censored instance.
3. DECISION TREES AND MISSING DATA TECHNIQUES
Decision trees (DTs) are one of the most popular approaches for both classification and regression type
predictions. They are generated based on specific rules. A DT is a classifier in a tree structure. A leaf
node is the outcome obtained and it is computed with respect to the existing attributes. A decision node is
based on an attribute, which branches for each possible outcome for that attribute. One approach to create
a DT is to use the entropy, which is a fundamental quantity in information theory. The entropy value
determines the level of uncertainty. The degree of uncertainty is related to the success rate of predicting
the result. Often the training dataset used for constructing a DT may not be a proper representative of the
real-life situation and may contain noise and the DT is said to over-fit the training data. To overcome the
over-fitting problem DTs use a pruning strategy that minimizes the output variable variance in the
validation data by not only selecting a simpler tree than the one obtained when the tree building
algorithm stopped, but one that is equally as accurate for predicting or classifying "new" instances.
3
4. Several methods have been proposed in the literature to treat missing data when using DTs. Missing
values can cause problems at two points when using DTs; 1) when deciding on a splitting point (when
growing the tree), and 2) when deciding into which daughter node each instance goes (when classifying
an unknown instance). Methods for taking advantage of unlabelled classes can also be developed,
although we do not deal with them in this thesis, i.e., we are assuming that the class labels are not
missing.
This section describes several MDTs that have been proposed in the literature to treat missing data
when using DTs. These techniques are also the ones used in the simulation study in the next section.
They are divided into three categories: ignoring and discarding data, imputation and machine learning.
3.1 Ignoring and Discarding Missing Data
Over the years, the most common approach to dealing with missing data has been to pretend there are
no missing data. The method used when you simply omit any instances that are missing data on the
relevant attributes and go through with the statistical analysis with only complete data is called
complete-case analysis or listwise deletion (LD). Due to its simplicity and ease of use, in many
statistical packages, LD is the default analysis. LD is also based on the assumption that data are MCAR.
3.2 Imputation
Imputation methods involve replacing missing values with estimated ones based on information
available in the dataset. Imputation methods can be divided into single and multiple imputation
methods. In single imputation the missing value is replaced with one imputed value, and in multiple
imputation, several values are used. Most imputation procedures for missing data are single imputation.
In the following section we briefly describe how each of the imputation techniques works.
3.2.1 Single Imputation Techniques
3.2.1.1 Decision Tree Approach
One approach suggested by Shapiro and described by Quinlan (1993) is to use a decision tree approach
to impute the missing values of an attribute. If S is the training set and X1 an attribute with missing
values. The method considers the subset S of S, with only instances where the attribute X1 is known.
In S the original class is regarded as another attribute while the value of X1 becomes the class to be
determined. A classification tree is built using S for predicting the value of X1 from the other
attributes and the class. Then the tree is used to fill the missing values. Decision tree single imputation
(DTSI), which can also be considered as a ML technique, is suitable for domains in which strong
relation between attributes exist, especially between the class attribute and the attributes with unknown
or missing values.
3.2.1.2 Expectation Maximization
In brief, expectation maximization (EM) is an iterative procedure where a complete dataset is created
by filling-in (imputing) one or more plausible values for the missing data by repeating the following
steps: 1.) In the E-step, one reads in the data, one instance at a time. As each case is read in, one adds to
the calculation of the sufficient statistics (sums, sums of squares, sums of cross products). If missing
values are available for the instance, they contribute to these sums directly. If a variable is missing for
4
5. the instance, then the best guess is used in place of the missing value. 2.) In the M-step, once all the
sums have been collected, the covariance matrix can simply be calculated. This two step process
continues until the change in covariance matrix from one iteration to the next becomes trivially small.
Details of the EM algorithm for covariance matrices are given in [Dempter et al., 1977; Little and
Rubin, 1987]. EM requires that data are MAR. As mentioned earlier, the EM algorithm (and its
simulation cased variants) could be utilised to impute only a single value for each missing value, which
from now on we shall call EM single imputation (EMSI). The single imputations are drawn from the
predictive distribution of the missing data given the observed data and the EM estimates for the model
parameters. A DT is then grown using the complete dataset. The tree obtained depends on the values
imputed.
3.2.1.3 Mean or Mode
Mean or more single imputation (MMSI) is one of the most common and extremely simple method of
imputation of missing values. In MMSI, whenever a value is missing for one instance on a particular
attribute, the mean (for a continuous or numerical attribute) or modal value (for a nominal or categorical
attribute), based on all non-missing instances, is used in place of the missing value. Although this
approach permits the inclusion of all instances in the final analysis, it leads to invalid results. Use of
MMSI will lead to valid estimates of mean or modal values from the data only if the missing value are
MCAR, but the estimates of the variance and covariance parameters (and hence correlations, regression
coefficients, and other similar parameters) are invalid because this method underestimates the
variability among missing values by replacing them with the corresponding mean or modal value. In
fact, the failure to account for the uncertainty behind imputed data seems to be the general drawback for
single imputation methods
3.2.2 Multiple Imputation
Multiple imputation (MI) is one of the most attractive methods for general purpose handling of missing
data in multivariate analysis. Rubin (1987) described MI as a three-step process. First, sets of plausible
values for missing instances are created using an appropriate model that reflects the uncertainty due to
the missing data. Each of these sets of plausible values can be used to “fill-in” the missing values and
create a “completed” dataset. Second, each of these datasets can be analyzed using complete-data
methods. Finally, the results are combined. For example, replacing each missing value with a set of five
plausible values or imputations would result to building five DTs, and the predictions of the five trees
would be averaged into a single tree, i.e., the average tree is obtained by multiple imputation.
There are various ways to generate imputations. Schafer (1997) has written a set of general purpose
programs for MI of continuous multivariate data (NORM), multivariate categorical data (CAT), mixed
categorical and continuous (MIX), and multivariate panel or clustered data (PNA). These programs
were initially created as functions operating within the statistical languages S and SPLUS [SPLUS,
2003]. NORM includes and EM algorithm for maximum likelihood estimation of means, variance and
covariances. NORM also adds regression-prediction variability by a procedure known as data
augmentation [Tanner and Wong, 1987]. Although not absolutely necessary, it is almost always a good
idea to run the EM algorithm before attempting to generate MIs. The parameter estimates from EM
provide convenient starting values for data augmentation (DA). Moreover, the convergence behaviour
5
6. of EM provides useful information on the likely convergence behaviour of DA. This is the approach we
follow in this paper, which we shall for now on call EMMI.
3.3 Machine Learning Techniques
ML algorithms have been successfully used to handling incomplete data. The ML techniques
investigated in this paper involve the use of decision trees [Breiman et al., 1984; Quinlan, 1993]. These
non-parametric techniques deal with missing values during the training (learning) or testing
(classification) process. A well-known benefit of nonparametric methods is their ability to achieve
estimation optimality for any input distribution as more data are observed, a property that no model
with a parametric assumption can have. In addition, tree-based models do not make any assumptions on
the distributional form of data and do not require a structured specification of the model, and thus not
influenced by data transformation, nor are they influenced by outliers [Breiman et al., 1984].
3.3.1 Fractional Cases
Quinlan (1993) borrows the probabilistic complex approach by Cestnik et al., (1987) by “fractioning”
cases or instances based on a priori probability of each value determined from the cases at that node that
have specified values. Quinlan starts by penalising the information gain measure by the proportion of
unknown cases and then splits these cases to both subnodes of the tree.
The learning phase requires that the relative frequencies from the training set be observed. Each case of,
say, class C with an unknown attribute value A is substituted. The next step is to distribute the unknown
examples according to the proportion of occurrences in the known instances, treating an incomplete
observation as if it falls down all subsequent nodes. For example, if an internal node t has ten known
examples (six examples with t L and four with t R ), then we would say the probability of t L = 0.6, and
the probability of t R is 0.4. Hence, a fraction of 0.6 of instance x is distributed down the branch for t L
and a fraction 0.4 of instance x to t R . This is carried out throughout the tree construction process. The
evaluation measure is weighted with the fraction of known values to take into account that the
information gained from that attribute will not always be available (but only in those cases where the
attribute value is known). During training, instance counts used to calculate the evaluation heuristic
include the fractional counts of instances with missing values. Instances with multiple missing values can
be fractioned multiple times into numerous smaller and smaller “portions”.
For classification, Quinlan (1993)‟s technique is to explore all branches below the node in question and
then take into account that some branches are more probable than others. Quinlan further borrows
Cestnik et al.‟s strategy of summing the weights of the instance fragments classified in different ways at
the leaf nodes of the tree and then choosing the class with the highest probability or the most probable
classification. Basically, when a test attribute has been selected, the cases with known values are divided
on the branches corresponding to these values. The cases with missing values are, in a way, passed down
all branches, but with a weight that corresponds to the relative frequency of the value assigned to a
branch. Both strategies for handling missing attribute values are used for the C4.5 system.
Despite its strengths, the fractional cases technique can be quite a slow, computationally intensive
process because several branches must do the calculation simultaneously. So, if K branches do the
6
7. calculation, then the central processing unit (CPU) time spent is K times the individual branch
calculation.
3.3.2 Surrogate Variable Splitting
Surrogate variable splitting (SVS) has been used for the CART system and further pursued by Therneau
and Atkinson (1997) in RPART. CART handles missing values in the database by substituting "surrogate
splitters". Surrogate splitters are predictor variables that are not as good at splitting a group as the
primary splitter but which yield similar splitting results; they mimic the splits produced by the primary
splitter; the second does second best, and so on. The surrogate splitter contains information that is
typically similar to that which would be found in the primary splitter. The surrogates are used for tree
nodes when there are values missing. The surrogate splitter contains information that is typically similar
to what would be found in the primary splitter. Both values for the dependent variable (response) and at
least one of the independent attributes take part in the modelling. The surrogate variable used is the one
that has the highest correlation with the original attribute (observed variable most similar to the missing
variable or a variable other than the optimal one that best predicts the optimal split). The surrogates are
ranked. Any observation missing on the split variable is then classified using the first surrogate variable,
or if missing that, the second is used, and so on. The CART system only handles missing values in the
testing case but RPART handles them on both the training and testing cases.
The idea of surrogate splits is conceptually excellent. Not only does it solve the problem of missing
values but it can help identify the nodes where masking or disguise (when one attribute hides the
importance of another attribute) of specific attributes occurs. This is due to its ability to making use of
all the available data, i.e., involving all the attributes when there is any observation missing the split
attribute. By using surrogates, CART handles each instance individually, providing a far more accurate
analysis. Also, other incomplete data techniques treat all instances with missing values as if the
instances all had the same unknown value; with that technique all such "missings" are assigned to the
same bin. For surrogate splitting, each instance is processed using data specific to that instance; and this
allows instances with different data patterns to be handled differently, which results in a better
characterisation of the data (Breiman et al., 1984). However, practical difficulties can affect the way
surrogate splitting is implemented. Surrogate splitting ignores the quantity of missing values. For
example, a variable taking a unique value for exactly one case in each class and missing on all other
cases yields the largest decrease in impurity (Wei-Yin, 2001). In addition, the idea of surrogate splitting
is reasonable if high correlations among the predictor variables exist. Since the “problem” attribute (the
attribute with missing values) is crucially dependent on the surrogate attribute in terms of a high
correlation, when the correlation between the “problem” attribute and the surrogate is low, surrogate
splitting becomes very clumsy and unsatisfactory. In other words, the method is highly dependent on
the magnitude of the correlation between the original attribute and its surrogate.
3.4 Missing data programs and codes
The LD, SVS and FC procedures are the only three embedded methods that do not estimate the missing
value or are not based on “filling in” a value for each missing datum when handling either incomplete
training and test data or either of the two.
Programs and code that were used for the methods are briefly described below:
7
8. No software or code was used for LD. Instead, all instances with missing values on that particular
attribute were manually excluded or dropped, and the analysis was applied only to the complete
instances.
For the SVS method, a recursive partitioning (RPART) routine, which implements within S-PLUS
many of the ideas found in the CART book and programs of Breiman et al. (1984) was used for both
training and testing decision trees. This programme, which handles both incomplete training and test
data, is by Therneau and Atkinson (1997).
The decision tree learner C4.5 was used as a representative of the FC or probabilistic technique for
handling missing attribute values in both the training and test samples. This technique is probabilistic in
the sense that it constructs a model of the missing values, which depends only on the prior distribution
of the attribute values for each attribute tested in a node of the tree. The main idea behind the technique
is to assign probability distributions at each node of the tree. These probabilities are estimated based on
the observed frequencies of the attribute values among the training instances at that particular node.
The remaining four methods are pre-replacing methods, which use estimation as a technique of
handling missing values, i.e., the process of “filling in” missing values in instances using some
estimation procedure.
The DTSI method uses a decision tree for estimating the missing values of an attribute and then uses the
data with filled values to construct a decision tree for estimating or filling in the missing values of other
attributes. This method makes sense when building a decision tree with incomplete data, the class
variable (which plays a major role in the estimation process) is always present. For classification
purposes (where the class variable is not present), first, imputation for one attribute (the attribute highly
correlated with class) was done using the mean (for numerical attributes) or mode (for categorical
attributes), and then the attribute was used to impute missing values of the other attributes using the
decision tree single imputation technique. In other words, two single imputation techniques were used
to handle incomplete test data. An S-PLUS code that was used to estimate missing attribute values
using a decision tree for both incomplete training and test data was developed.
S-PLUS code was also developed for the MMSI approach. The code was developed in such a way that
it replaced the missing data for a given attribute by the mean (for numerical or quantitative attribute) or
mode (for nominal or qualitative attribute) of all known values of that attribute.
There are many implementations of MI. Schafer‟s (1997) set of algorithms (headed by the NORM
program) that use iterative Bayesian simulation to generate imputations was an excellent option. NORM
was used for datasets with only continuous attributes. A program called MIX written is used for mixed
categorical and continuous data. MIX is an extension of the well-known general location model. It
combines a log-linear model for the categorical variables with a multivariate normal regression for the
continuous ones. For strictly categorical data, CAT was used. All three programs are available as S-PLUS
routines. Schafer (1997), and Schafer and Olsen (1998) gives details of the general location model and
other models that could be used for imputation tasks. One critical part of MI is to assess the convergence
of data augmentation. The use of data augmentation (DA) was facilitated to create multiple imputations
and by following the rule of thumb by Schafer (1997) to first run the EM algorithm prior to running DA
to get a feel of how many iterations may be needed. This is due to experience that shows that DA (an
iterative simulation technique that combines features of the EM algorithm and MI, with M imputations of
8
9. each missing value at each iteration) nearly always converges in fewer iterations than EM. Therefore, EM
estimates of the parameters were computed and then recorded the number of iterations required, say t.
Then, we performed a single run of data augmentation algorithm of length tM using the EM estimates as
starting values, where M is the number of imputations required. The convergence of the EM algorithm is
linear and is determined by the fraction of missing information. Thus, when the fraction of missing
information was large, convergence was very slow due to the number of iterations required. However, for
small missing value proportions convergence was obtained much more rapidly with less strenuous
convergence criteria. We used the completed datasets from iterations 2t, 4t, …, 2Mt. In our experiments
we used MI with M=5, and averaged the predictions of the 5 resulting trees.
Due to the limit of the dynamic memory in S-PLUS for Windows [S-PLUS, 2003] when using the EM
approach, all the big datasets were partitioned into subsets, and S-PLUS run on one subset at a time.
Our partitioning strategy was to put variables with high correlations with close scales (for continuous
attributes) into the same subset. This strategy made the convergence criteria in the iterative methods
easier to set up and very likely to produce more accurate results. The number of attributes in each subset
depended on the number of instances and the number of free parameters to be estimated in the model,
which included cell probabilities, cell means and variance-covariances. The number of attributes in each
subset was determined in such a way that the size of the data matrix and the dynamic memory
requirement was under the S-PLUS limitation and the number of instances was large relative to the
number of free parameters. Separate results from each subset were then averaged to produce an
approximate EM-based method which are substituted for (and continue to call) EM in our investigation.
To measure the performance of methods, the training set/test set methodology is employed. For each
run, each dataset is split randomly into 80% training and 20% testing, with different percentages of
missing data (0%, 15%, 30%, and 50%) in the covariates for both the training and testing sets. A
classifier was built on the training data and the predicted accuracy is measured by the smoothed error
rate of the tree, and was estimated on the test data.
Trees on complete training data were grown using the Tree function in S-PLUS [Becker et al., 1988,
Venables and Ripley, 1994]. The function uses the GINI index of impurity [Breiman et al., 1984] as a
splitting rule and cross validation cost-complexity pruning as pruning rule. Accuracy of the tree, in the
form of a smoothed error rate, was predicted using the test data.
4. EXPERIMENTS
4.1. Experimental Set-Up
The objective of this paper is to investigate the robustness and accuracy of methods for tolerating
incomplete data using tree-based models. This section describes experiments that were carried out in
order to compare the performance of the different approaches previously proposed for handling missing
values in both the training set and test (unseen) set. The effects of different proportions of missing
values when building or learning the tree (training) and when classifying new instances (testing) are
further examined, experimentally. Finally, the impact of the nature of different missing data
mechanisms on the classification accuracy of resulting trees is examined. A combination of small and
large datasets, with a mixture of both nominal and numerical attribute variables, was used for these
9
10. tasks. All datasets have no missing values. The main reason for using datasets with no missing values is
to have total control over the missing data in each dataset.
To perform the experiment each dataset was split randomly into 5 parts (Part I, Part II, Part III, Part IV,
Part V) of equal (or approximately equal) size. 5-fold cross validation was used for the experiment. For
each fold, four of the parts of the instances in each category were placed in the training set, and the
remaining one was placed in the corresponding test. The same splits of the data were used for all the
methods for handling incomplete data.
Since the distribution of missing values among attributes and the missing data mechanism were two of
the most important dimensions of this study, three suites of data were created corresponding to MCAR,
MAR and IM. In order to simulate missing values on attributes, the original datasets are run using a
random generator (for MCAR) and a quintile attribute-pair approach (for both MAR and IM,
respectively). Both of these procedures have the same percentage of missing values as their parameters.
These two approaches were also run to get datasets with four levels of proportion of missingness p, i.e.,
0%, 15%, 30% and 50% missing values. The experiment consists of having p% of data missing from
both the training and test sets. This was carried out for each dataset and 5-fold cross validation was
used. Note that the modelling of the three suites was carried out after the training-testing split for each
of the 5 iterations of cross validation. In other words, missingness was injected after the splitting of the
data into training and test sets for each fold.
The missing data mechanisms were constructed by generating a missing value template (1 = present, 0
= missing) for each attribute and multiplying that attribute by a missing value template vector. Our
assumption is that the instances are independent selections.
For each dataset, two suites were created. First, missing values were simulated on only one attribute.
Second, missing values were introduced on all the attribute variables. For the second suite, the
missingness was evenly distributed across all the attributes. This was the case for the three missing data
mechanisms, which from now on shall be called MCARuniva, MARuniva, IMuniva (for the first suite)
and MCARunifo, MARunifo, IMunifo (for the second suite). These procedures are described below.
For MCAR, each vector in the template (values of 1‟s for non-missing and 0‟s for missing) was
generated using a random number generator utilising the Bernoulli distribution. The missing value
template is then multiplied by the attribute of interest, thereby causing missing values to appear as zeros
in the modified data.
Simulating MAR values was more challenging. The idea is to condition the generation of missing values
based upon the distribution of the observed values. Attributes of a dataset are separated into pairs, say,
(A X , A Y ) , where A Y is the attribute into which missing values are introduced and A X is the attribute on
the distribution of which the missing values of A Y is conditioned, i.e., P(AY miss | A X observed) . Pair
selection of attributes was based on high correlations among the attributes. Since we want to keep the
percentage of missing values at the same level overall, we had to alter the percentage of missing values of
the individual attributes. Thus, in the case of k% of missing values over the whole dataset, 2k% of
missing values were simulated on A Y . For example, having 10% of missing values on two attributes is
equivalent to having 5% of missing values on each attribute. Thus, for each of the A X attributes its 2k
10
11. quintile was estimated. Then all the instances were examined and whenever the A X attribute has a value
lower than the 2k quintile a missing value on A Y is imputed with probability 0, and 1 otherwise. More
formally, P(AY miss | A X 2k) 0 or P(AY miss | A X 2k) 1 . This technique generates a missing value
template which is then multiplied with A Y . Once again, the attribute chosen to have missing values was
the one highly correlated with the class variable. Here, the same levels of missing values are kept. For
multi-attributes, different pairs of attributes were used to generate the missingness. Each attribute is
paired with the one it is highly correlated to. For example, to generate missingness in half of the attributes
for a dataset with, say, 12 attributes (i.e., A1 ,... A 12 ) , the pairs (A1 , A 2 ) , (A3 , A 4 ) and (A5 , A 6 ) could be
utilised. We assume that A1 is highly correlated with A 2 ; A 3 highly correlated with A 4 , and so on.
In contrast to the MAR situation outlined above where data missingness is explainable by other measured
variables in a study, IM data arise due to the data missingness mechanism being explainable, and only
explainable by the very variable(s) on which the data are missing. For conditions with data IM, a
procedure identical to MAR was implemented. However, for the former, the missing values template was
created using the same attribute variable for which values are deleted in different proportions.
For consistency, missing values were generated on the same attributes for each of the three missing data
mechanisms. This was done for each dataset. For split selection, the impurity approach was used. For
pruning, a combination of 10-fold cross validation cost complexity pruning and 1 Standard Error (1-SE)
rule (Breiman et al. 1984) to determine the optimal value for the complexity parameter was used. The
same splitting and pruning rules when growing the tree were carried out for each of the twenty one
datasets.
It was reasoned that the condition with no missing data should be used as a baseline and what should be
analysed is not the error rate itself but the increase or excess error induced by the combination of
conditions under consideration. Therefore, for each combination of method for handling incomplete
data, the number of attributes with missing values, proportion of missing values, and the error rate for
all data present was subtracted from each of the three different proportions of missingness. This would
be the justification for the use of differences in error rates analysed in some of the experimental results.
All statistical tests were conducted using the MINITAB statistical software program (MINITAB, 2002).
Analyses of variance, using the general linear model (GLM) procedure [Kirk, 1982] were used to
examine the main effects and their respective interactions. This was done using a 4-way repeated
measures designs (where each effect was tested against its interaction with datasets). The fixed effect
factors were the: missing data techniques; number of attributes with missing values (missing data
patterns); missing data proportions; and missing data mechanisms. A 1% level of significance was used
because of the many number of effects. The twenty one datasets used were used to estimate the
smoothed error. Results were averaged across five folds of the cross-validation process before carrying
out the statistical analysis. The averaging was done as a reduction in error variance benefit. A summary
of all the main effects and their respective interactions are provided in the Appendix in the form of
Analysis of Variance (ANOVA) table.
11
12. 4.2. Datasets
This section describes the twenty one datasets that were used in the experiments to explore the impact
of missing values on the classification accuracy of resulting decision trees. All twenty one datasets were
obtained from the Machine Learning Repository maintained by the Department of Information and
Computer Science at the University of California at Irvine [Murphy and Aha, 1992]. They are
summarized in Table 1.
Table 1 Datasets used for the experiments
Attributes
Dataset Instances Classes
Ordered Nominal
Two classes:
german 1000 7 13 2
glass (G2) 163 9 0 2
heart-statlog 270 13 0 2
ionosphere 351 31 1 2
kr-vs-kp 3196 0 36 2
labor 57 8 8 2
pima-indians 768 8 0 2
sonar 208 60 0 2
More than two classes:
balance scale 625 4 0 3
iris 150 4 0 3
waveform 5000 40 0 3
lymphography 148 3 15 4
vehicle 846 18 0 4
anneal 898 6 32 5
glass 214 9 0 6
satimage 6435 36 0 6
image 2310 19 0 7
zoo 101 1 15 7
LED 24 1500 0 24 10
vowel 990 10 3 11
letter 20000 16 0 26
The first eight involve datasets with only two classes and the last thirteen involve datasets with more
than two classes.
As shown in Table 1, the selected twenty one datasets cover a comprehensive range for each of the
following characteristics:
the size of datasets, expressed in terms of the number of instances ranges between 57 and
20000
the number of attributes ranges between 4 and 6
12
13. the number of classes ranges between 2 and 26
the number of the type of attributes (numerical or nominal or both)
In general, the datasets were selected in order to assure reasonable comprehensiveness of the results.
4.3. Experimental Results
Experimental results on the effects of current methods for handling both incomplete training and test
data on predictive accuracy on resulting decision trees for different proportions, patterns are and
mechanisms of missing data are summarised in Figure 1.
M C A R univa
( aver ag ed o ver 2 1 d o mains) M C A R unif o
14 ( aver ag ed o ver 2 1 d o mains)
13 14
12 13
11 12
excess error (%)
10 11
excess error (%)
9 10
8 9
7 8
7
6
6
5
5
4
4
3 3
2 2
1 1
0 0
0 15 30 50 0 15 30 50
% o f missing values in training and test sets % o f missing values in training and test sets
LD DTSI
LD DTSI EM SI M M SI
EM SI M M SI EM M I FC
EM M I FC SVS
SVS
M A R univa
M A R unif o
( aver ag ed o ver 2 1 d o mains)
( aver ag ed o ver 2 1 d o mains)
18
17 18
16 17
15 16
14 15
14
excess error (%)
excess error (%)
13
12 13
12
11 11
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 15 30 50 0 15 30 50
% o f missing values in training and test sets % o f missing values in training and test sets
LD DTSI LD DTSI
EM SI M M SI EM SI M M SI
EM M I FC EM M I FC
SVS SVS
IM univa
( aver ag ed o ver 2 1 d o mains) I M uni f o
( aver ag ed o ver 2 1 d o mai ns)
24
23 24
22 23
21 22
20 21
19 20
18 19
excess error (%)
17 18
excess error (%)
16 17
15 16
14 15
13 14
12 13
11 12
10 11
9 10
8 9
7 8
6 7
5 6
4 5
3 4
2 3
1 2
0 1
0
0 15 30 50 0 15 30 50
% o f missing values in training and test sets % o f missing values in training and test sets
LD DTSI LD DTSI
EM SI M M SI EM SI M M SI
EM M I FC EM M I FC
SVS SVS
Fig. 1. Effects of missing values in training and test data on the excess error for methods over the 21
domains. A) MCARuniva, B) MCARunifo, C) MARuniva, D) MARunifo, E) IMuniva, F) IMunifo
13
14. The error rates of each method of the introduced missing values are averaged over the 21 datasets. Also,
all the error rates increased over the complete data case formed by taking differences. From these
experiments the following results are observed.
For MCARuniva data, EMMI has on average the best accuracy while LD exhibits one of the biggest
increases in error (Figure 1A). From Figure 1B, most of the methods achieve slightly bigger error rate
increases for MCARunifo data compared with MCARuniva data. In the MARuniva suite, the behaviour
of methods is not very different from the one observed in the MCARuniva case (Figure 1C). From
Figure 1D, the performance of methods for MARunifo data is very similar to the one observed for
MCARunifo data. Figure 1E shows bigger increases in error rates for all the methods for IMuniva data
compared with MCARuniva and MARuniva data, individually. The performance of all the methods, on
average, worsens for IMunifo data. Once again, EMMI proves to be the best method at all levels of
missing with LD exhibiting the worst performance with an excess error rate of 23.9% at the 50% level
(Figure 1F).
Main Effects
All the main effects (training and testing methods, number of attributes with missing values, missing
data proportions and missing data mechanisms) were found to be significant at the 1% level.
From Figure 2, EMMI represents a superior approach to missing data while LD is substantially inferior
to the other techniques. The second best method is FC, closely followed by EMSI, DTSI, MMS and
SVS, respectively. However, there appears no clear „winner‟ between DTSI, EMSI, MMSI, FC and
SVS in terms of classification accuracy.
--+---------+---------+---------+--------+------
(--*---) LD
(--*---) DTSI
(---*--) EMSI
(---*--) MMSI
(--*---) EMMI
(---*--) FC
(---*--) SVS
--+---------+---------+---------+--------+------
0.055 0.096 0.112 0.128 0.144
(pooled standard deviation)
Fig. 2. Comparison for training and testing
methods: confidence intervals of mean error rates (*)
From Figure 3, it appears that missing values have a greater effect when they are distributed among all
the attributes compared with when missing values are on a single attribute variable.
The results for proportion of missing values in both the training and test sets show increases in missing
data proportions being associated with increases in error rates (Fig. 4). In fact, the error rate increase
when 50% of values are missing in both the training and test sets is about one and a half times as big as
the error rate increase when 15% of values are missing on both sets.
14
15. 16 16
14 14
12 12
excess error (%)
excess error (%)
10 10
8 8
6 6
4 4
2 2
0 0
ATTRuniva ATTRunif o 15 30 50
attributes w ith missing v alues missing v alues in training and test sets
(%)
Fig. 3. Overall means for number Fig. 4. Overall means for missing
of attributes with missing values data proportions
18
16
14
excess error (%)
12
10
8
6
4
2
0
M CAR M AR IM
missing data mechanism
Fig. 5. Overall means for missing
data mechanisms
From the results presented in Figure 5, IM values entail more serious deterioration in predictive
accuracy compared with randomly missing data (i.e. MCAR or MAR data). Overall, MCAR data have a
lesser impact on classification accuracy with an error rate increase difference of about 4% (when
compared with MAR data) and a much bigger difference of about 10% (when compared with IM data).
Interaction effects
The interaction effect between methods for handling incomplete training and test data and the number
of attributes with missing values is displayed in Figure 6. From the figure, it follows that all methods
perform differently from each other with bigger error rate increases observed when missing values are
in all the attributes compared with when they are on a single attribute variable. A severe impact of
having missing values in only one attribute and having missing values in all the attributes is observed
for DTSI, LD and SVS, while for the remaining methods the impact is not clear.
From Figure 7, the performance by methods do not differ much at lower levels of missing values but
vary noticeably as the amount of missing values increases.
15
16. 0.20
0.20
0.18
0.18
0.16
0.16
excess error rate
0.14
excess error rate
0.14
0.12
0.12
0.10
0.10
0.08
0.08
0.06
0.06
ATTRuniva ATTRunif o
0.04
0.04 15 30 50
0.02 0.02
0.00 0.00
LD DTSI EMSI MMSI EMMI FC SVS LD DTSI EMSI MMSI EMMI FC SVS
methods methods
Fig. 6. Interaction between Fig. 7. Interaction between
methods and number of atributes methods and proportion of missing
with missing values values
As observed earlier, all the methods are more severely impacted by IM data compared with MCAR or
MAR data (Figure 8). In addition, all the methods are more effective for dealing with MCAR data.
0.22 0.16
0.20
0.14
0.18
0.12
excess error rate
0.16
excess error rate
0.14 0.10
0.12
0.08
0.10
0.08 0.06
0.06 0.04
ATTRuniva ATTRunif o
0.04
MCAR MAR IM 0.02
0.02
0.00 0.00
LD DTSI EMSI MMSI EMMI FC SVS 15 30 50
methods % of missing v alues in training set
Fig. 8 Interaction between Fig. 9. Interaction between number
methods and missing data of attributes with missing values and
mechanisms proportion of missing values
The results of the interaction between the number of attributes with missing values and missing data
proportions show increases in missing data proportions being associated with slightly greater increases
in excess error rate (Figure 9). Once again, missing values appear to have more impact when they are
distributed in all the attributes than when they are in only one attribute.
5. DISCUSSION AND CONCLUSIONS
This paper examined the accuracy of several commonly used techniques to handle incomplete when
using DTs. To date, there have been very few studies examining the effects of MDTs for different
patterns and mechanisms of missing data. Each of the techniques reviewed in this paper has strengths
and limitations. However, the use of imputations has become common in many fields in the last few
years. It is due to the increase in incomplete data in research, and to the advanced methodology for
imputations. The selection of a particular MDT can have important consequences on the data analysis
and subsequent interpretation of findings in models of supervised learning, especially if the amount of
missing data is large. The impact of missing data can not only lead to bias in results (especially
parameter estimates derived from the model) but to loss of information and thus to a decrease in
16
17. statistical power of hypothesis tests. This may also result in misleading conclusions drawn from a
research study and limit generalizability of the research findings.
Significant advances have been made in the past few decades regarding methodologies which handle
responses to problems and biases which can be caused by incomplete data. Unfortunately, these
methodologies are often not available to many researchers for a variety of reasons (for example, lack of
familiarity, computational challenges) and researchers often resort to ad-hoc approaches to handling
incomplete data, ones which may ultimately do more harm than good [Little and Rubin, 1987; Schafer
and Graham, 2002]. Several researchers have examined various techniques to solve the problem of
incomplete data. One popular approach includes discarding instances with missing values and
restricting the attention to the completely observed instances, which is also known as listwise deletion
(LD). This is the default in commonly used statistical packages [Little and Rubin, 1987]. Another
common way uses imputation (estimation) approaches that fill in a missing value with an efficient
single replacement value, such as the mean, mode, hot deck (using data from other observations in the
sample at hand), and approaches that take advantage of multivariate regression and k-nearest neighbour
models. Another technique for treating incomplete data is to model the distribution of incomplete data
and estimate the missing values based on certain parameters. Specific results are discussed below.
Lakshminarayan et al. (1999) performed a simulation study comparing two ML methods for missing
data imputation. Their results show that for the single imputation task, the supervised learning
algorithm C4.5 [Quinlan, 1993], which utilizes the fractional cases (FC) strategy, performed better than
Autoclass [Cheeseman et al., 1988], a strategy based on unsupervised Bayesian probability. For the
multiple imputation (MI) task, both methods perform comparably.
Kalousis and Hilario (2000) evaluated seven classification algorithms with respect to missing values: two
rule inducers (C5.0-rules and Ripper), one nearest neighbour method, one orthogonal (C5.0-tree), one
oblique decision tree algorithm, a naïve Bayes algorithm and a linear discriminant. Various patterns and
mechanisms of missingness (MCAR and MAR) in current complete datasets were simulated. Their
results indicate that naïve Bayes (NB) is most resilient to missing values while the k-nearest neighbour
single imputation (kNNSI) and FC are more sensitive to missing values. Their results further show that
for a given proportion of missing values, the distribution of missing values among attributes is at least as
important as the mechanism of missingness.
Another comparative study of LD, MMSI, similar response pattern imputation (SRPI) and full
information maximum likelihood (FIML) in the context of software cost estimation was carried out by
Myvreit et al. (2001). The simulation study was carried out using 176 projects. Their results show
FIML performing well for missing completely at random (MCAR) data. Also, LD, MMSI and SRPI are
shown to yield biased results for other missing data mechanisms other than MCAR.
Fujikawa and Ho (2002) evaluated theoretically several methods of dealing with missing values. The
methods evaluated were MMSI, linear regression, standard deviation method, kNNSI, DTSI, auto-
associative neural network, LD, lazy decision tree, FC and SVS. kNNSI and DTSI showed good results.
In terms of computation cost, MMSI and FC were found to be reasonably good.
Batista and Monard (2003) investigated the effects of four methods of handling missing data at different
proportions of missing values. There methods investigated were kNNSI, MMSI, and internal algorithms
used by FC and CN2 to treat missing data. Missing values were artificially simulated in different rates
17
18. and attributes into the datasets. kNNSI imputation showed a superior performance compared with MMSI
when missing values were in one attribute. However, both methods compared favourably when missing
values were in more than one attribute. Otherwise, FC achieved a performance as good as kNNSI.
Song and Shepperd (2004) evaluated kNNSI and class mean imputation (CMI) for different patterns and
mechanisms of missing data. Their results show kNNSI slightly outperforming CMI with the missing
data mechanisms having no impact on either of the two imputation methods.
Twala et al. (2005) evaluates the impact of seven MDTs (LD, EMSI, kNNSI, MMSI, EMMI, FC and
SVS) on 8 industrial datasets by artificially simulating three different proportions, two patterns and
three mechanisms of missing data. Their results show EMMI achieving the highest accuracy rates with
other notably good performances by methods such as FC and EMSI. The worst performance was by
LD. Their results further show MCAR data as easier to deal with compared with IM data. Twala (2005)
found missing values as more damaging when they are in the test sample than in the training sample.
According to the above studies, among the single imputation techniques, the results are not so clear,
especially for small amounts of missing data. However, the performance of each technique differs with
increases in the amount of missing data. Also, despite the fact that the LD procedure involves an
efficiency cost due to the elimination of a large amount of valuable data, most researchers have
continued to use it due to its simplicity and ease of use. There are other problems caused by using the
LD technique. For example, elimination of instances with missing information decreases the error
degrees of freedom in statistical tests such as the student t distribution. This decrease leads to reduced
statistical power (i.e. the ability of a statistical test to discover a relationship in a dataset) and larger
standard errors. Other researchers have shown that randomly deleting 10% of the data from each
attribute in a matrix of five attributes can easily result in eliminating 59% of instances from analysis
[Kim and Curry, 1977]. Furthermore, MI, which overcomes limitations of single imputation seem not to
have been widely adopted by researchers even though it has been shown to be flexible and software for
creating multiple imputations is available and some downloadable free of charge from the Methodology
Centre website at the Penn State University http://methodology.psu.edu/. A great deal of past research
on the effectiveness of MDTs to overcome missing data problems has utilized data that is missing
randomly. Finally, results from previous studies suggest that results achieved using simulated data are
very sensitive to the MAR assumption. Hence, if there is a reason to believe that if the MAR
assumption does not hold, alternative methods should be used.
The research questions asked which MDTs yielded the least amount of average error when using tree-
based models.
The results of the simulation study show that the proportion of missing data, the missing data
mechanism, the pattern of missing values, and the design of database characteristics (especially the type
of attributes) all have effects on the performance of any MDT.
The effects of missing data have been found to adversely affect DT learning and classification
performance, and this effect is positively correlated with the increasing fractions of missing data.
Another point of discussion is the significance of having missing values in only one attribute, on the
one hand, and allowing missing values in all the attributes, on the other hand. The idea was to see the
impact of pattern over mechanism, or vice versa, at both lower levels and higher levels of missing
18
19. values. Our results show the impact on the performance of methods being caused by the pattern and
mechanism of missing values, especially at lower levels of missingness. However, as the proportion of
missing values increases the major determining factor on the performance of methods is how the
missing values are distributed among attributes. All methods yield lower accuracy rates when missing
values are distributed among all the attributes (MCARunifo, MARunifo and IMunifo) compared with
when missing values are on a single attribute (MCARuniva, MARuniva and IMuniva).
The worse performance achieved by methods are for IM data, followed by MAR and MCAR data,
respectively. This is in accordance with statistical theory which considers MCAR easier to deal with
and IM data as very complex to deal with since it requires assumptions that cannot be validated from
the data at hand [Little and Rubin, 1987]. In addition, in many settings the MAR assumption is more
reasonable than the MCAR. In fact, an MAR method is valid if data are MCAR or MAR, but MCAR
methods are valid only if data are MCAR. This could have attributed to the superior performance of
EMMI (an MAR method) and the substantially inferior performance of LD (an MCAR method).
The results also show that the performance of methods depends on whether missing values are in the
test or training set or in both the training and test sets. Training methods appear to achieve superior
performances compared with testing methods. An explanation for such behaviour will be given later in
the section.
With this experimental set-up, it is easy to say with conviction that from the eight current techniques
investigated that EMMI is the overall best method for handling both incomplete training and test data.
However, there are competitors like FC and DTSI which performed reasonably well. One important
advantage of FC over DTSI is that it can handle missing values in both the training and test sets while
DTSI struggles as a technique for handling incomplete test data and when all attributes have missing
values. The heavily dependence of DTSI on strong correlations among attributes might have attributed
to its poor performance as correlations among attributes for some of the datasets were not strong.
However, DTSI performs better when missing values are on a single attribute – a very serious
restriction. The results also indicate that LD is the worst method for handling incomplete data. In
general, it can be seen that model-based methods have better performance than ad hoc methods.
Furthermore, probabilistic methods seem to outperform non-probabilistic methods.
There are several dimensions on which learning methods of handling incomplete data using tree-based
models can be compared. Also, combinations of methods for handling incomplete data while varying
the number of attributes with missing values were not tried. However, prediction accuracy rates of
estimation methods like the EMMI were very impressive. The improvement in accuracy of EMMI over
single imputation methods (CCSI, DTSI, EMSI and MMSI) and other methods (LD, SVS and FC)
could be as a result of a reduction in variance resulting from averaging the number of trees like is done
in bagging (Breiman, 1996). Even though EMMI emerges as the overall best of the eight techniques, it
has come under fire by critics claiming that proper imputations, necessary for valid inferences, are
difficult to produce, especially in data where multiple factors are deficient (Schafer, 1997), and even
then EMMI is biased in some cases [Robins and Wang, 2000]. Another argument against EMMI is that
it is much more difficult to implement than some of the techniques mentioned. One potential problem
that was encountered in this research is convergence of the EM algorithm [Wu, 1983], especially for big
datasets and datasets with more than 30 attribute variables.
19
20. The results also show that the performance of methods depends on whether missing values are in the
test or training set or in both the training and test sets. When looking at the overall performance of
methods, training methods appear to achieve superior performances compared with testing methods.
However, in terms of relative performance, they seem to be about the same.
Each dataset might have more or less its own favourite techniques for processing incomplete data.
However, most of our dataset results are similar to one another. Several factors contribute here: the
methods used; the different types of datasets; the distribution of missing values among attributes, the
magnitude of noise, distortion and source of missingness in each dataset.
There was evidence from our results that how well a method performs depends on the dataset one is
analysing. In fact, all methods were able to handle datasets with numerical attributes better compared
with datasets with only nominal or a mixture of both nominal and numerical attributes.
For small datasets all the techniques seemed to work well with the estimation methods, especially
EMMI which performed better than both the other estimation methods and machine learning methods.
One the other hand, LD gave the worst performance for small datasets. This seems to be logical since
when using LD you tend to lose a lot of information, especially at higher levels of missing values.
However, for bigger datasets, LD was equally effective compared with methods like CCSI and MMSI
and outperformed them in other situations. The effectiveness of LD probably stems from the fact that
deletion techniques result in data matrices that mirror the true data structure. When data are
systematically missing from a study, the imputation techniques create a “reproduced” data matrix with a
structure somewhat different from that of the true data matrix.
Following EMMI as the second best overall method for handling both incomplete training and test data
is FC. In general it can be seen that probabilistic methods have better performance than non
probabilistic methods. However, one important disadvantage of FC, just like EMMI, is that it takes a
long time processing (especially big trees) due to the way it handles missing values. Due to its reliance
on the number of branches to do the calculation simultaneously, if K branches do the calculation, then
the CPU time spent is K times the individual branch calculation.
Of all the single imputation methods, EMSI seems to perform the best for datasets with numerical
attributes. However, despite its strengths, EMSI suffers (like all the single imputation methods) from
biased and sometimes inefficient estimates. DTSI and MMSI achieved good results when missing
values were on categorical attributes. For DTSI, this was the case when the missing values were in only
one attribute while MMSI gave good results generally and in some cases was a good method for
datasets with mixed attributes. Overall, the differences in results between single imputation methods are
relatively small. The similarity of results begs an important question: when and why should we choose
one single imputation method over the other?
Like DTSI, for all the datasets where the correlations among attributes were found to be quite high,
SVS (which relies heavily on strong concordance between a primary splitter and its surrogate(s)),
achieved good results. However, SVS also struggles when missing values are distributed among all the
attributes. In fact, for a few datasets SVS collapsed completely when an instance was missing all the
surrogates. However, some strategies when simulating the missingness among the attributes were used
when the technique collapsed. In addition, the performance of SVS improves as a method for handling
incomplete test data compared with handling incomplete training data.
20
21. Another point of discussion is why missing values are more damaging when they are in the test sample
than training sample. If you have a lot of training data then missing values do not make much impact on
the parameter estimates but missing data in the test set refers to only individual cases. That is, the
training data yield statistical summaries, but the test data are concerned with individuals. In other
words, missing data will tend to cancel each other out when training the model. On a new test case, the
investigator must still suffer accuracy affects though, inevitably. What is really happening here is that
the increased error in test cases is to be expected and the significantly reduced error when training is a
pleasant surprise and this is due to the averaging.
Furthermore, it is worthwhile mentioning that the performance of some methods could have been
slightly affected by other factors like errors in some datasets. For example, the Pima Indians diabetes
database had quite a number of observations with "zero" values, which are most likely to indicate
missing values although the data was described as being complete. Nonetheless, the prediction that the
impact of certain types of missing data mechanisms on both the testing and training cases should differ
by dataset, by mechanism and by the proportion of missing values is confirmed.
Some experimental results from Section 4.3 support previous findings in the literature and other results
extend the literature. The relative superiority of model-based methods over ad hoc methods is consistent
with past results [Kim and Curry, 1977; Little and Rubin, 1987; Rubin, 1987].
Overall, the performance of each MDT under more complex forms of systematic missingness is
unknown and likely to be problematic [Little and Rubin, 1987]. Systematic missingness in this
simulation was always based on the variables that were in the model rather than unmeasured variables
or combinations of variables. In addition, it is impossible, in practice, to demonstrate whether data are
MAR versus IM, because the values of the missing data are not available for comparison. IM is still a
problem for the methods reviewed here.
While this study confirms statistical insight on the importance of the pattern of missingness, it also
reveals that the distribution of unknown values among the predictors plays an equally (if not more)
decisive role, depending on the percentage of missing data. In a subsequent study we plan to distribute
missing values over as many attributes as possible (other than being restricted to one).
So far, we have restricted our experiments to only tree-based models. It would be interesting to carry
out a comparative study of tree based models with other (non-tree) methods which can handle missing
values. Furthermore, it would be possible to explore the different patterns and levels of missing values.
This paper also addresses the problem of tolerating missing values in both the training and test set.
Breiman et al., (1984) argue that missing values have more impact when they occur in the test set. An
interesting topic would be to assess the impact of missing values when they only occur in either the
training or test set.
We leave the above issues to be investigated in the future.
In sum, this paper provides the beginnings of a better understanding of the relative strengths and
weaknesses of MDTs and using decision trees as their component classifier. It is hoped that it will
motivate future theoretical and empirical investigations into incomplete data and decision trees, and
perhaps reassure those who are uneasy regarding the use of imputed data in prediction.
21
22. 6. REFERENCES
ARBUCKLE, J.L. AND & WOTHKE, W. (1999). Amos 4.0 user’s guide. Chicago, IL: Smallwaters.
BATISTA, G. AND MONARD, M.C. (2003). An Analysis of Four Missing Data Treatment Methods for
Supervised Learning, Applied Artificial Intelligence, 17, 519-533.
BECKER, R., CHAMBERS, J., AND WILKS, A. (1988). The New S Language. Wadsworth
International Group.
BREIMAN, L., FRIEDMAN, J., OLSHEN, R., AND STONE, C. (1984). Classification and Regression
Trees, Wadsworth.
BREIMAN, L. (1996). Bagging predictors. Machine Learning, 26 (2), 123-140.
CARTWRIGHT, M., SHEPPERD, M.J., AND SONG, Q. (2003). Dealing with Missing Software
Project Data. In Proceedings of the 9th International Symposium on Software Metrics 2003, 154-165.
CESTNIK, B., KONONENKO, I. AND BRATKO, I. (1987). Assistant 86: a knowledge-elicitation tool
for sophisticated users. In I. Bratko and N. Lavrac, editors, European Working Session on Learning –
EWSL87. Sigma Press, Wilmslow, England, 1987.
CHEESEMAN, P., KELLY, J., SELF, M., STUTZ, J., TAYLOR, W., AND FREEMAN, D. (1988).
Bayesian Classification. In Proceedings of American Association of Artificial Intelligence (AAAI),
Morgan Kaufmann Publishers: San Meteo, CA, 607-611.
CHEN, J. AND SHAO, J. (200). Nearest Neighbour Imputation for Survey Data. Journal of Official
Statistics, 16, 2, 113-131.
DEMPSTER, A.P., LAIRD, N.M., AND RUBIN, D.B. (1977). Maximum likelihood estimation from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38.
EL-EMAM, K. AND BIRK, A. (2000). Validating the ISO/IEC 15504 Measures of Software
Development Process Capability. Journal of Systems and Software, 51 (2), 119-149.
FUJIKAWA, Y., HO, T.B. (2002). Cluster-based Algorithms for Filling Missing Values, 6th Pacific-
Asia Conf. Knowledge Discovery and Data Mining, Taiwan, 6-9 May, Lecture Notes in Artificial
Intelligence 2336, Springer Verlag, 549-554.
GRAHAM, J.W. AND HOFER, S.M. (1993). EMCOV.EXE user’s guide (unpublished manuscript).
University Park, PA: Pennsylvania State University Department of Bahavioral Health.
JÖNSSON, P. AND WOHLIN, C. (2004). An Evaluation of k-Nearest Neighbour Imputation Using
Likert Data. In Proceedings of the 10th International Symposium on Software Metrics, 108-118,
September 11-17, 2004.
KALOUSIS, A. AND HILARIO, M. (2000). Supervised knowledge discovery from incomplete data. In
Proceedings of the 2nd International Conference on Data Mining 2000. Cambridge, England. July
2000.WIT Press.
KALTON, G. (1983). Compensating for Missing Survey Data, Michigan.
22
23. KERRY, J.O. AND CURRY, J. (19770. The treatment of missing data in multivariate analysis.
Sociological methods & Research, 6, 215-241.
KIM, J.-O. AND CURRY, J. (1977). The treatment of missing data in multivariate analysis.
Sociological Methods and Research, 6, 215-240.
KIRK, R.E. (1982). Experimental design (2nd Ed.). Monterey, CA: Brooks, Cole Publishing Company.
LAKSHMINARAYAN, K., HARP, S.A., AND SAMAD, T. (1999). Imputation of Missing Data in
Industrial Databases. Applied Intelligence, 11, 259-275.
LITTLE, R.J.A. AND RUBIN, D.B. (1987). Statistical Analysis with missing data. New York: Wiley.
MINITAB. (2002). MINITAB Statistical Software for Windows 9.0. MINITAB, Inc., PA, USA.
MULTIPLE IMPUTATION SOFTWARE. Available from <http:/www.stat.psu.edu/jls/misoftwa.html,
http:/methcenter.psu.edu/EMCOV.html>
MUTHÉN, L.K AND MUTHÉN, B.O. (1998). Mplus user’s guide. Los Angeles: Muthén & Muthén.
MURPHY, P. AND AHA, D. (1992). UCI Repository of machine learning databases [Machine-
readable data repository]. University of California, Department of Information and Computer
Science, Irvine, CA.
MYRTVEIT, I., STENSRUD, E., AND OLSSON, U. (2001). Analyzing Data Sets with Missing Data:
An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods. IEEE Transactions
on Software Engineering, 27 (11), 1999-1013.
PYLE, D. (1999). Data Preparation for Data Mining. Morgan Kauffman, San Francisco.
QUINLAN, JR. (1993). C.4.5: Programs for machine learning. Los Altos, California: Morgan
Kauffman Publishers, INC.
ROBINS, D.B. AND WANG, N. (2000). Inference for imputation estimators. Biometrika, 87, 113-124.
RUBIN, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.
RUBIN, D.B. (1996). Multiple Imputation After 18+ Years. Journal of the American Statistical
Association, 91, 473-489.
SCHAFER, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman and Hall, London.
SCHAFER, J.L. AND OLSEN, M.K. (1998). Multiple Imputation for multivariate missing data
problems: a data analyst‟s perspective. Multivariate Behavioral Research, 33 (4): 545-571.
SCHAFER, J.L. AND GRAHAM, J.W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7 (2), 147-177.
SENTAS, P., LEFTERIS, A., AND STAMELOS, I. (2004). Multiple Logistic Regression as Imputation
method Applied on Software Effort prediction. In Proceedings of the 10th International Symposium
on Software Metrics, Chicago, 14-16 September 2004.
STRIKE, K., EL-EMAM, K.E., AND MADHAVJI, N. (2001). Software Cost Estimation with
Incomplete Data. IEEE Transaction on Software Engineering, 27 (10), 890-908.
23
24. SONG, Q. AND SHEPPERD, M. (2004). A Short Note on Safest Default Missingness Mechanism
Assumptions. In Empirical Software Engineering. (Accepted for publication in 2004).
S-PLUS. (2003). S-PLUS 6.2 for Windows. MathSoft, Inc., Seattle, Washington, USA.
TANNER, M.A. AND WONG, W.H. (1987). The Calculation of Posterior Distributions by Data
Augmentation (with discussion). Journal of the American Statistical Association, 82, 528-550
THERNEAU, T.M., AND ATKINSON, E.J. (1997). An Introduction to Recursive Partitioning Using
the RPART Routines. Technical Report, Mayo Foundation.
TWALA, B. (2005). Effective Techniques for Handling Incomplete Data Using Decision Trees.
Unpublished PhD thesis, Open University, Milton Keynes, UK
TWALA, B., CARTWRIGHT, M., AND SHEPPERD, M. (2005). Comparison of Various Methods for
Handling Incomplete Data in Software Engineering Databases. 4th International Symposium on
Empirical Software Engineering, Noosa Heads, Australia, November 2005.
VENABLES, W.N. AND RIPLEY, B.D. (1994). Modern Applied Statistics with S-PLUS. New York:
Springer.
WU, C.F.J. (1983). On the convergence of the EM algorithm. The Annals of Statistics, 11, 95-103.
24