This document describes the analysis of a dataset containing attributes of donors to identify probable future donors. It discusses the attributes in the dataset, data preprocessing steps including handling missing data. It then describes the methodology used for attribute selection which included a logic-based selection of attributes and validation using attribute selection methods in Weka. Various models were built using algorithms like Naive Bayes, J48, Decision Stump, OneR and ZeroR to predict probable donors. The models' performance was evaluated and compared using various metrics like accuracy, precision, F-measure etc. Areas of error in models were also analyzed.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
This document demonstrates clustering and regression techniques using the Weka data mining software. It shows how Weka can be used to cluster 600 bank customer records into 6 groups based on attributes like age, income, family status, etc. It also uses Weka to create a linear regression model to predict house prices based on attributes like size, number of bedrooms, lot size, and more. Overall, the document shows how Weka allows easy implementation of common data mining algorithms and visualization of results.
This document provides an overview of Google Cloud Platform (GCP) services. It discusses computing services like App Engine and Compute Engine for hosting applications. It covers storage options like Cloud Storage, Cloud Datastore and Cloud SQL. It also mentions big data services like BigQuery and machine learning services like Prediction API. The document provides brief descriptions of each service and highlights their key features. It includes code samples for using Prediction API to train a model and make predictions on new data.
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
In today’s world, gigantic amount of data is available in science, industry, business and many
other areas. This data can provide valuable information which can be used by management for
making important decisions. But problem is that how can find valuable information. The answer
is data mining. Data Mining is popular topic among researchers. There is lot of work that
cannot be explored till now. But, this paper focuses on the fundamental concept of the Data mining i.e. Classification Techniques. In this paper BayesNet, NavieBayes, NavieBayes Uptable, Multilayer perceptron, Voted perceptron and J48 classifiers are used for the classification of data set. The performance of these classifiers analyzed with the help of Mean Absolute Error, Root Mean-Squared Error and Time Taken to build the model and the result can be shown statistical as well as graphically. For this purpose the WEKA data mining tool is used.
Classification and Prediction Based Data Mining Algorithm in Weka ToolIRJET Journal
The document discusses different classification algorithms in the Weka data mining tool for predicting data, including J48, SMO, Naive Bayes, REPTree, and Multilayer Perception. It analyzes their performance on a housing dataset, finding that Multilayer Perception had the highest accuracy at 95.45%. The algorithms are evaluated using metrics like accuracy, precision, and recall calculated from a confusion matrix. Multilayer Perception is identified as the best performing algorithm for this classification task.
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdfNeha Singh
In 2023, aspiring data analysts can expect comprehensive data analytics course curriculums covering essential topics like statistical analysis, data visualization, machine learning, and big data processing. To prepare for the course, brushing up on basic mathematics, programming, and data handling skills would be beneficial.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
This document presents an analysis of a dataset containing 200,000 mortgage loan applications to predict the interest rate spread. Key findings include:
- The most important predictive features were loan amount, loan type, property type, preapproval status, loan purpose, median family income, applicant income, and minority population percentage.
- A boosted decision tree regression model achieved the highest prediction accuracy with an R-squared of 0.77 on test data, outperforming linear regression and random forest models.
- The analysis included data exploration of relationships between numerical features, feature selection, model training, tuning, and validation.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
This document demonstrates clustering and regression techniques using the Weka data mining software. It shows how Weka can be used to cluster 600 bank customer records into 6 groups based on attributes like age, income, family status, etc. It also uses Weka to create a linear regression model to predict house prices based on attributes like size, number of bedrooms, lot size, and more. Overall, the document shows how Weka allows easy implementation of common data mining algorithms and visualization of results.
This document provides an overview of Google Cloud Platform (GCP) services. It discusses computing services like App Engine and Compute Engine for hosting applications. It covers storage options like Cloud Storage, Cloud Datastore and Cloud SQL. It also mentions big data services like BigQuery and machine learning services like Prediction API. The document provides brief descriptions of each service and highlights their key features. It includes code samples for using Prediction API to train a model and make predictions on new data.
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
In today’s world, gigantic amount of data is available in science, industry, business and many
other areas. This data can provide valuable information which can be used by management for
making important decisions. But problem is that how can find valuable information. The answer
is data mining. Data Mining is popular topic among researchers. There is lot of work that
cannot be explored till now. But, this paper focuses on the fundamental concept of the Data mining i.e. Classification Techniques. In this paper BayesNet, NavieBayes, NavieBayes Uptable, Multilayer perceptron, Voted perceptron and J48 classifiers are used for the classification of data set. The performance of these classifiers analyzed with the help of Mean Absolute Error, Root Mean-Squared Error and Time Taken to build the model and the result can be shown statistical as well as graphically. For this purpose the WEKA data mining tool is used.
Classification and Prediction Based Data Mining Algorithm in Weka ToolIRJET Journal
The document discusses different classification algorithms in the Weka data mining tool for predicting data, including J48, SMO, Naive Bayes, REPTree, and Multilayer Perception. It analyzes their performance on a housing dataset, finding that Multilayer Perception had the highest accuracy at 95.45%. The algorithms are evaluated using metrics like accuracy, precision, and recall calculated from a confusion matrix. Multilayer Perception is identified as the best performing algorithm for this classification task.
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdfNeha Singh
In 2023, aspiring data analysts can expect comprehensive data analytics course curriculums covering essential topics like statistical analysis, data visualization, machine learning, and big data processing. To prepare for the course, brushing up on basic mathematics, programming, and data handling skills would be beneficial.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
This document presents an analysis of a dataset containing 200,000 mortgage loan applications to predict the interest rate spread. Key findings include:
- The most important predictive features were loan amount, loan type, property type, preapproval status, loan purpose, median family income, applicant income, and minority population percentage.
- A boosted decision tree regression model achieved the highest prediction accuracy with an R-squared of 0.77 on test data, outperforming linear regression and random forest models.
- The analysis included data exploration of relationships between numerical features, feature selection, model training, tuning, and validation.
The objective of this investigation is to predict the behavior of the decision of a customer on a car model based on given six features. Features being Buying Price, Maintenance price, Number of doors, Seaters, Luggage space, and Safety.
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
This document discusses classification techniques for data mining. It provides an overview of common classification algorithms including decision trees, k-nearest neighbors (kNN), and Naive Bayes. Decision trees use a top-down approach to classify data based on attribute tests at each node. kNN identifies the k nearest training examples to classify new data points. Naive Bayes assumes independence between attributes and uses Bayes' theorem for classification. The document also discusses how these techniques are used for data cleaning, integration, transformation and knowledge representation in the data mining process.
This document describes a data warehouse and business intelligence project for analyzing Starbucks store data. It discusses extracting data from various structured, semi-structured, and unstructured sources, transforming the data using SQL and R, and loading it into a star schema data warehouse with fact and dimension tables. The data warehouse is then used for business queries and analysis in Tableau, with case studies examining city revenue, visitor and beverage sales by city, and city ratings based on food and beverage counts. The analysis finds that New York City generally has the highest revenue, visitor counts, and ratings.
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
This document describes a study that used machine learning to develop an e-healthcare monitoring system for diagnosing heart disease. The researchers used a modified support vector machine (SVM) algorithm to analyze cardiovascular disease data and predict whether patients have heart disease. They evaluated the performance of their modified SVM against other machine learning models like random forest, gradient boosting, and AdaBoost. The modified SVM achieved the highest accuracy of 88.8%, outperforming the other models. The study concludes that machine learning and deep learning methods can help enable early detection, classification, and prediction of cardiovascular disease.
This document discusses using machine learning to predict stock prices based on historical data. Specifically, it uses a random forest regression model to predict stock prices for the NSE nifty 50 index over the next year, month, and five days. It collects stock price data, preprocesses the data by feature selection, scaling, and splitting into train and test sets. It then trains the random forest regressor on the training set and evaluates the model's performance on the test set using various metrics like RMSE, MAE, and R-squared. The model is able to generate predicted stock prices and identify buy, sell, and hold prices for investors over different time periods.
The document provides an overview of data analysis. It discusses the core components of data analysis including descriptive, diagnostic, predictive, prescriptive, and cognitive analysis. It describes the roles of a data analyst including preparing, modeling, visualizing, analyzing, and managing data. The tasks of a data analyst are preparing data, modeling the data, visualizing results, analyzing the visualizations, and managing the information. Descriptive statistics, Excel, and Power BI are highlighted as important tools for data analysts. The document is an introductory lecture on data analysis concepts and the data analyst's job.
The document discusses the six main steps for building machine learning models: 1) data access and collection, 2) data preparation and exploration, 3) model build and train, 4) model evaluation, 5) model deployment, and 6) model monitoring. It describes each step in detail, including exploring and cleaning the data, choosing a model type, training the model, evaluating model performance on test data, deploying the trained model, and monitoring the model after deployment. The process is iterative, with steps like data preparation and model training often repeated to improve the model.
This document is a machine learning class assignment submitted by Trushita Redij to their supervisor Abhishek Kaushik at Dublin Business School. The assignment discusses data preprocessing techniques, decision trees, the Chinese Restaurant algorithm, and building supervised learning models. Specifically, linear regression and KNN classification models are implemented on population data from Ireland to predict total population and classify countries.
The document discusses using k-means clustering on a life insurance customer dataset to predict customer preferences. It first provides background on k-means clustering and its application in data mining. It then describes applying k-means to a dataset of 14,180 customer records with 10 attributes from an Albanian insurance company. This identified 5 clusters characterizing different customer segments based on attributes like gender, age, and preferred insurance product type and amount. The results help the insurance company better understand customer preferences to improve performance.
1RUNNING HEAD Normalization2NormalizationNORM.docxdrennanmicah
1
RUNNING HEAD: Normalization
2
Normalization
NORMALIZATION
Charles Williams
CS352 Unit 3 IP
Professor Jeffery Karlberg
1/26/2019
Table of Contents
Table of Contents………………………………………………………………………………………………………………………………………………………………………………………………..….2
The Database Models, Languages, and Architecture 3
Database System Development Life Cycle 5
Database Management Systems 6
Advanced SQL 11
Web and Data Warehousing and Mining in the Business World 12
References 13
Database management
It is important that a formal design methodology is used as it provides a mathematical approach to coming up with a reliable database that consolidates all the environments that use the database. A design methodology helps as it provides a way in which the whole designing and development can be done with minimum errors. The design methodology helps in identifying the requirements, the specifications and design levels of the database and data warehouse up for development. The planning stage of the consolidated data base is very important as it involves the coming up with plans that will guide the development of the database (Mabogunje, 2015). The plans help in managing quality, time, risks and other related issues that might affect the design and development of the database and eventually the data warehouse.
The three layers of the 3-level ANSI-SPARC architecture include; a physical schema which is responsible for defining how data is to be stored, a conceptual schema which is responsible for indexing and relating data, and the external schema which is responsible for showing how information was presented. The 3-level ANSI-SPARC type of architecture is designed to guard and guide data change. The primary function of the first layer is to define how data is stored. It is important to note that there can be changes in the physical schema and the changes will not affect how external applications will interact with the stored data (Pokorný, 2018). The second layer’s primary function is to provide a consolidated view of a database. The third layer’s primary function is to define richer APIs and it can do so without necessarily having to change the underlying storage mechanisms in place. The 3-level ANSI-SPARC architecture helps in promoting data independence which in turn helps save time in the long run through the conceptual schema which emphasizes data mapping.
Data administrator and database administrator
A data administrator is an individual whose function is to gather data requirements, analyze data as well as design data and classify data types. They two primary roles of a data administrator include; coming up with data standards that will be applied in databases, and coming up with policies that will dictate on data security, data access, data usage, dataflow well as data authorization in an organization. Other minor duties of data administrator include; playing an assistance role by coming up with data resources and allowing for the sharing of data across ap.
This document describes a proposed smart health guide app that would allow users to scan food product barcodes and receive guidance on whether that product is suitable for their health condition. The app is intended to help people with common diseases like diabetes, cholesterol, and jaundice make informed choices by checking food nutrients. It would use a machine learning decision tree model trained on product data to analyze barcodes scanned by users and provide consumption recommendations based on their registered health details. The proposed system aims to improve on traditional shopping that does not consider nutritional information. It would retrieve product data from a Firebase database and allow authorized admins to add, update or delete product entries as needed.
Data Mining is a significant field in today’s data-driven world. Understanding and implementing its concepts can lead to discovery of useful insights. This paper discusses the main concepts of data mining, focusing on two main concepts namely Association Rule Mining and Time Series Analysis
The proposal begins with an information paper, covering the importance of data management. This includes important concepts related to data quality (validity). It’s followed by a strategic plan to create a Data Management Section/Directorate.
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...IRJET Journal
The document presents a proposed method for robust outsourcing of multi-party datasets while preserving privacy. The method utilizes supermodularity and perturbation techniques. It first pre-processes the dataset to remove unnecessary data. It then replaces attribute values with hierarchies using supermodularity to balance data utility and risk. Association rules are generated and sensitive rules are separated and hidden by decreasing their support levels. Patterns are generated from the encrypted datasets of different parties. Experimental results show the proposed method improves over previous works in terms of lower risk, higher utility, fewer rules, and lower space costs.
The document discusses data preprocessing techniques. It covers why preprocessing is important by addressing issues like incomplete, inaccurate, or inconsistent data. It then describes major tasks in preprocessing like data cleaning, integration, reduction, transformation. Data cleaning techniques discussed include handling missing values, removing noise, and resolving inconsistencies. The goal of preprocessing is to improve data quality and prepare it for data mining.
The document describes a data science project conducted on streaming log data from Cloudera Movies, an online streaming video service. The goals of the project were to understand which user accounts are used most by younger viewers, segment user sessions to improve site usability, and build a recommendation engine. Key steps included exploring and cleaning the data, classifying users as children or adults using a SimRank approach, clustering user sessions to identify behavior patterns, and predicting user ratings through user-user and item-item similarity models to build a recommendation system. Accuracy of 99.64% was achieved in classifying users.
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsAnkit Ghosalkar
Application of Data Mining Techniques like Linear Discriminant Analysis(LDA), k-means clustering, Multiple Linear Regression, Principle Component Analysis(PCA) and Logistic Regression on Datasets
This document provides an introduction to data mining concepts including definitions, tasks, challenges, and techniques. It discusses data mining definitions, the data mining process including data preprocessing steps like cleaning, integration, transformation and reduction. It also covers common data mining tasks like classification, clustering, association rule mining and the Apriori algorithm. Overall, the document serves as a high-level overview of key data mining concepts and methods.
The objective of this investigation is to predict the behavior of the decision of a customer on a car model based on given six features. Features being Buying Price, Maintenance price, Number of doors, Seaters, Luggage space, and Safety.
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
This document discusses classification techniques for data mining. It provides an overview of common classification algorithms including decision trees, k-nearest neighbors (kNN), and Naive Bayes. Decision trees use a top-down approach to classify data based on attribute tests at each node. kNN identifies the k nearest training examples to classify new data points. Naive Bayes assumes independence between attributes and uses Bayes' theorem for classification. The document also discusses how these techniques are used for data cleaning, integration, transformation and knowledge representation in the data mining process.
This document describes a data warehouse and business intelligence project for analyzing Starbucks store data. It discusses extracting data from various structured, semi-structured, and unstructured sources, transforming the data using SQL and R, and loading it into a star schema data warehouse with fact and dimension tables. The data warehouse is then used for business queries and analysis in Tableau, with case studies examining city revenue, visitor and beverage sales by city, and city ratings based on food and beverage counts. The analysis finds that New York City generally has the highest revenue, visitor counts, and ratings.
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
This document describes a study that used machine learning to develop an e-healthcare monitoring system for diagnosing heart disease. The researchers used a modified support vector machine (SVM) algorithm to analyze cardiovascular disease data and predict whether patients have heart disease. They evaluated the performance of their modified SVM against other machine learning models like random forest, gradient boosting, and AdaBoost. The modified SVM achieved the highest accuracy of 88.8%, outperforming the other models. The study concludes that machine learning and deep learning methods can help enable early detection, classification, and prediction of cardiovascular disease.
This document discusses using machine learning to predict stock prices based on historical data. Specifically, it uses a random forest regression model to predict stock prices for the NSE nifty 50 index over the next year, month, and five days. It collects stock price data, preprocesses the data by feature selection, scaling, and splitting into train and test sets. It then trains the random forest regressor on the training set and evaluates the model's performance on the test set using various metrics like RMSE, MAE, and R-squared. The model is able to generate predicted stock prices and identify buy, sell, and hold prices for investors over different time periods.
The document provides an overview of data analysis. It discusses the core components of data analysis including descriptive, diagnostic, predictive, prescriptive, and cognitive analysis. It describes the roles of a data analyst including preparing, modeling, visualizing, analyzing, and managing data. The tasks of a data analyst are preparing data, modeling the data, visualizing results, analyzing the visualizations, and managing the information. Descriptive statistics, Excel, and Power BI are highlighted as important tools for data analysts. The document is an introductory lecture on data analysis concepts and the data analyst's job.
The document discusses the six main steps for building machine learning models: 1) data access and collection, 2) data preparation and exploration, 3) model build and train, 4) model evaluation, 5) model deployment, and 6) model monitoring. It describes each step in detail, including exploring and cleaning the data, choosing a model type, training the model, evaluating model performance on test data, deploying the trained model, and monitoring the model after deployment. The process is iterative, with steps like data preparation and model training often repeated to improve the model.
This document is a machine learning class assignment submitted by Trushita Redij to their supervisor Abhishek Kaushik at Dublin Business School. The assignment discusses data preprocessing techniques, decision trees, the Chinese Restaurant algorithm, and building supervised learning models. Specifically, linear regression and KNN classification models are implemented on population data from Ireland to predict total population and classify countries.
The document discusses using k-means clustering on a life insurance customer dataset to predict customer preferences. It first provides background on k-means clustering and its application in data mining. It then describes applying k-means to a dataset of 14,180 customer records with 10 attributes from an Albanian insurance company. This identified 5 clusters characterizing different customer segments based on attributes like gender, age, and preferred insurance product type and amount. The results help the insurance company better understand customer preferences to improve performance.
1RUNNING HEAD Normalization2NormalizationNORM.docxdrennanmicah
1
RUNNING HEAD: Normalization
2
Normalization
NORMALIZATION
Charles Williams
CS352 Unit 3 IP
Professor Jeffery Karlberg
1/26/2019
Table of Contents
Table of Contents………………………………………………………………………………………………………………………………………………………………………………………………..….2
The Database Models, Languages, and Architecture 3
Database System Development Life Cycle 5
Database Management Systems 6
Advanced SQL 11
Web and Data Warehousing and Mining in the Business World 12
References 13
Database management
It is important that a formal design methodology is used as it provides a mathematical approach to coming up with a reliable database that consolidates all the environments that use the database. A design methodology helps as it provides a way in which the whole designing and development can be done with minimum errors. The design methodology helps in identifying the requirements, the specifications and design levels of the database and data warehouse up for development. The planning stage of the consolidated data base is very important as it involves the coming up with plans that will guide the development of the database (Mabogunje, 2015). The plans help in managing quality, time, risks and other related issues that might affect the design and development of the database and eventually the data warehouse.
The three layers of the 3-level ANSI-SPARC architecture include; a physical schema which is responsible for defining how data is to be stored, a conceptual schema which is responsible for indexing and relating data, and the external schema which is responsible for showing how information was presented. The 3-level ANSI-SPARC type of architecture is designed to guard and guide data change. The primary function of the first layer is to define how data is stored. It is important to note that there can be changes in the physical schema and the changes will not affect how external applications will interact with the stored data (Pokorný, 2018). The second layer’s primary function is to provide a consolidated view of a database. The third layer’s primary function is to define richer APIs and it can do so without necessarily having to change the underlying storage mechanisms in place. The 3-level ANSI-SPARC architecture helps in promoting data independence which in turn helps save time in the long run through the conceptual schema which emphasizes data mapping.
Data administrator and database administrator
A data administrator is an individual whose function is to gather data requirements, analyze data as well as design data and classify data types. They two primary roles of a data administrator include; coming up with data standards that will be applied in databases, and coming up with policies that will dictate on data security, data access, data usage, dataflow well as data authorization in an organization. Other minor duties of data administrator include; playing an assistance role by coming up with data resources and allowing for the sharing of data across ap.
This document describes a proposed smart health guide app that would allow users to scan food product barcodes and receive guidance on whether that product is suitable for their health condition. The app is intended to help people with common diseases like diabetes, cholesterol, and jaundice make informed choices by checking food nutrients. It would use a machine learning decision tree model trained on product data to analyze barcodes scanned by users and provide consumption recommendations based on their registered health details. The proposed system aims to improve on traditional shopping that does not consider nutritional information. It would retrieve product data from a Firebase database and allow authorized admins to add, update or delete product entries as needed.
Data Mining is a significant field in today’s data-driven world. Understanding and implementing its concepts can lead to discovery of useful insights. This paper discusses the main concepts of data mining, focusing on two main concepts namely Association Rule Mining and Time Series Analysis
The proposal begins with an information paper, covering the importance of data management. This includes important concepts related to data quality (validity). It’s followed by a strategic plan to create a Data Management Section/Directorate.
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...IRJET Journal
The document presents a proposed method for robust outsourcing of multi-party datasets while preserving privacy. The method utilizes supermodularity and perturbation techniques. It first pre-processes the dataset to remove unnecessary data. It then replaces attribute values with hierarchies using supermodularity to balance data utility and risk. Association rules are generated and sensitive rules are separated and hidden by decreasing their support levels. Patterns are generated from the encrypted datasets of different parties. Experimental results show the proposed method improves over previous works in terms of lower risk, higher utility, fewer rules, and lower space costs.
The document discusses data preprocessing techniques. It covers why preprocessing is important by addressing issues like incomplete, inaccurate, or inconsistent data. It then describes major tasks in preprocessing like data cleaning, integration, reduction, transformation. Data cleaning techniques discussed include handling missing values, removing noise, and resolving inconsistencies. The goal of preprocessing is to improve data quality and prepare it for data mining.
The document describes a data science project conducted on streaming log data from Cloudera Movies, an online streaming video service. The goals of the project were to understand which user accounts are used most by younger viewers, segment user sessions to improve site usability, and build a recommendation engine. Key steps included exploring and cleaning the data, classifying users as children or adults using a SimRank approach, clustering user sessions to identify behavior patterns, and predicting user ratings through user-user and item-item similarity models to build a recommendation system. Accuracy of 99.64% was achieved in classifying users.
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsAnkit Ghosalkar
Application of Data Mining Techniques like Linear Discriminant Analysis(LDA), k-means clustering, Multiple Linear Regression, Principle Component Analysis(PCA) and Logistic Regression on Datasets
This document provides an introduction to data mining concepts including definitions, tasks, challenges, and techniques. It discusses data mining definitions, the data mining process including data preprocessing steps like cleaning, integration, transformation and reduction. It also covers common data mining tasks like classification, clustering, association rule mining and the Apriori algorithm. Overall, the document serves as a high-level overview of key data mining concepts and methods.
2. Donor Datamining | Jalaj Nautiyal
Table of Contents
1.0 Executive Summary..............................................................................................................................3
2.0 Data Set Description.............................................................................................................................4
2.1 Attributes.............................................................................................................................................4
2.1.1 Location ..................................................................................................................................4
2.1.2 Income Level ..........................................................................................................................5
2.1.3 Education ................................................................................................................................6
2.1.4 Median Home Value...............................................................................................................6
2.1.5 Number of Donor’s.................................................................................................................7
2.1.6 Dollar Amount of Gift ............................................................................................................7
2.1.7 Average Dollar Amounts of gifts............................................................................................8
2.1.8 Military Association................................................................................................................9
2.1.9 Type of Donor and RFA .........................................................................................................9
2.1.10 Dollar Gift in 97NK............................................................................................................10
2.1.11 Per Capita............................................................................................................................11
2.1.12 Correlation Matrix ..............................................................................................................11
2.1.13 Regression Coefficients ......................................................................................................12
2.2 Attribute Data-Type...........................................................................................................................13
3.0 Missing Data.......................................................................................................................................14
4.0 Attribute Selection..............................................................................................................................15
4.1 Methodology .....................................................................................................................................15
Step-1: Logic Based Selection.......................................................................................................15
Step-2: Attribute Filtration based on Multiple Attribute Selection Methods.................................26
5.0 Models ................................................................................................................................................31
5.1 Model-1.............................................................................................................................................31
5.2 Model-2:............................................................................................................................................32
5.3 Model-3 .............................................................................................................................................34
5.4 Model-4 .............................................................................................................................................37
5.5 Model-5 .............................................................................................................................................38
5.6 Model-6 .............................................................................................................................................40
5.8 Model-8: ............................................................................................................................................43
5.9 Model-9 .............................................................................................................................................44
P a g e 1 | 100
3. Donor Datamining | Jalaj Nautiyal
6.0 Different number of attributes but same number of records using NaiveBayes Model........................46
6.1 5 Attributes ....................................................................................................................................46
6.2 10 Attributes.....................................................................................................................................49
6.3 15 Attributes......................................................................................................................................52
6.4 20 Attributes......................................................................................................................................56
6.5 25 Attributes......................................................................................................................................59
6.5 30 Attributes......................................................................................................................................63
6.6 35 Attributes......................................................................................................................................66
6.7 40 Attributes......................................................................................................................................69
7.0 Performance Metrics.............................................................................................................................74
7.1 Calculations for Each Model – Precision and Sensitivity..................................................................74
7.2 Calculations for Each Model – Specificity and NPV ........................................................................77
7.3 Calculations for Each Model – Accuracy and F-Measure.................................................................80
7.4 Comparison of performance of different models based on different algorithms and settings...........83
7.5 Comparison of performance of NaiveBayes Algorithm with different number of attributes............90
8.0 Error Analysis.......................................................................................................................................97
8.1 NaiveBayes .......................................................................................................................................97
8.2 OneR Model......................................................................................................................................98
P a g e 2 | 100
4. Donor Datamining | Jalaj Nautiyal
1.0 Executive Summary
As part of this project we were provided with veteran’s dataset. Dataset consisted of many different
attributes collected from various sources like census, mailing list etc. among others. The objective
of the project is to utilize the dataset provided and identify probable donor’s.
Various datamining concepts taught in the course were utilized for the purpose of analyzing the
dataset, modelling and identification of probable donor’s. The dataset was first analyzed for data-
type and missing data. Weka was used to preprocess, analyze and interpret the data.
Some of the important attributes were selected from the target dataset and analyzed in depth to
understand the relation/cross correlation among the attributes and how much of the variations in
the selected attributes defines probability of future donor.
For selection of attributes which can predict probable donor, two pronged approach was used. In
first iteration, logic based selection of the attributes from target dataset was conducted. In addition
to this, Weka’s ChiSquared, GainRationAttributeEval, InfoGainAttributeEval were used to
identify top ranked attributes. I then compared the results from these two approaches and
included/excluded few attributes from final list of attributes. The methodology and results are
shared in detail in the paper.
Final list of attributes were used to run on NaiveBayes, J48Graft, DecisionStump, OneR and ZeroR
algorithms to ascertain the attributed selected by above methodology are useful to predict probable
donor. The methodology and results are shared in detail in the paper.
After utilizing above algorithms and generated models, the models were applied to the final dataset
and various statistics were run on the final dataset in Weka to see the accuracy of model fit with
the dataset. A list of control numbers were selected from this final dataset and submitted.
P a g e 3 | 100
5. Donor Datamining | Jalaj Nautiyal
2.0 Data Set Description
We have a dataset with 47,705 records. Each record has 441 columns. Different attributes and
associated statistics are displayed below.
2.1 Attributes
2.1.1 Location
Geographic location of a particular person is important for ascertaining probable future donor and
out of many possible attribute in dataset, I analyzed states and impact of various income levels to
get distribution of Family and Household incomes – few states analyzed are as follows.
Figure 1 Income Distribution-California
Figure 2 Income Distribution - Colorado
P a g e 4 | 100
6. Donor Datamining | Jalaj Nautiyal
Figure 3 Income Distribution - Florida
2.1.2 Income Level
Expendable income is an important consideration for estimating future donor and hence number
of people below poverty line analyzed for each state also gives idea about the states for probable
donor’s.
Figure 4 Income Level
P a g e 5 | 100
7. Donor Datamining | Jalaj Nautiyal
2.1.3 Education
Education is an important parameter in my analysis and a proxy to level of education is the number
of magazine subscription. Higher the education higher is the probability of the person becoming
probable donor.
Figure 5 Magazine Subscription
2.1.4 Median Home Value
Income is an important parameter to ascertain probable donor and following chart shows the
distribution of HV1 – Median home value across 50 states. Higher the value of HV1 more
probable will be the state from which probable donor can be obtained.
Figure 6 Median Home Value
P a g e 6 | 100
8. Donor Datamining | Jalaj Nautiyal
2.1.5 Number of Donor’s
Analyzing the target dataset provided the current number of donor’s across different states is
important information to ascertain future probable donor and the following chart provides the
information to help on the analysis.
Figure 7 Number of Donor’s
2.1.6 Dollar Amount of Gift
The dollar amount of lifetime gifts to date by current donor’s is important attribute to estimate the
probability of future donor. RAMNTALL is the dollar amount of lifetime gifts to date and the
following chart shows the average values of RAMNTALL in different states.
Figure 8 Life Time Dollar Amount
P a g e 7 | 100
9. Donor Datamining | Jalaj Nautiyal
LASTGIFT is an attribute for dollar amount of most recent gift. This is important to get an idea
about how recent the current donor gifted. Following chart shows the Average LASTGIFT for
different states.
Figure 9 Average Gift Amount of most recent gift
2.1.7 Average Dollar Amounts of gifts
Average dollar amounts of gifts to date is also important attribute to help estimate future donor.
Following chart shows the average of average dollar amounts (AVGGIFT) of gifts to date for
different states.
Figure 10 Avg Dollar Amt of gifts to date
P a g e 8 | 100
10. Donor Datamining | Jalaj Nautiyal
2.1.8 Military Association
Association with military in past or current is important attribute to help us estimate the probable
donor. WWIIVETS is the attribute which gives percentage of WWII vets. Following chart shows
the average of the percentage of WWII vets for different states.
Figure 11 Avg World War II Vets
2.1.9 Type of Donor and RFA
Following chart provides information on Super donor’s who have participated in RFA from
RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective
to ascertain probable donor.
Figure 12 Number of Super Donor’s for RFA_3 to RFA_23
Following chart provides information on Active donor’s who have participated in RFA from
RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective
to ascertain probable donor.
P a g e 9 | 100
11. Donor Datamining | Jalaj Nautiyal
Figure 13 Number of Active Donor’s for RFA_3 to RFA_23
2.1.10 Dollar Gift in 97NK
TARGET_D is the dollar amount associated with response to 97NK mailing. The chart below
shows sum of TARGET_D for different states. This information is important as it gives idea about
the spending of the current donor’s.
Figure 14 Sum of Dollar Gift Amt to 97 Mailing list
P a g e 10 | 100
12. Donor Datamining | Jalaj Nautiyal
2.1.11 Per Capita
Per capita income of the donor’s(IC-5) is important attribute to ascertain future donor. Following
chart shows average per capita income for donor’s across different states.
Figure 15 Per Capita Income
2.1.12 Correlation Matrix
Following table shows the correlation matrix for some of the attributes to understand how
different attributes impact the person being probable donor. This gave me idea to select some of
the attributes worth considering for final model selection.
Figure 16 Correlation Matrix
P a g e 11 | 100
13. Donor Datamining | Jalaj Nautiyal
2.1.13 Regression Coefficients
Based on the correlation coefficients, I ran excel regression to ascertain how these attributes are
representative of Target_B and which variables have explanatory power as the starting point for
filtering the number of attributes.
P a g e 12 | 100
14. Donor Datamining | Jalaj Nautiyal
2.2 Attribute Data-Type
Change of attribute datatype (ask if writing ‘varchar to in’ is appropriate or not)
• I changed the datatype of attributes from IC6 to IC23 from varchar to int. As I wanted to
perform various calculations like sum, average etc. on the values of these attributes.
• I changed the datatype of attribute HHSA4 from varchar to int to perform various
calculations like sum, average etc. on the values of this attribute.
• Table1 below describes few more attributes for which I changed the value from varchar to
int so that I can perform various calculations like sum, average etc. on the values of these
attributes.
Table1
Attribute
Original
Datatype
Changed
Datatype
MBGARDEN varchar int
MBCRAFT varchar int
MBBOOKS varchar int
MBCOLECT varchar int
MAGFAML varchar int
MAGFEM varchar int
MAGMALE varchar int
PUBGARDN varchar int
PUBGARDN varchar int
PUBHLTH varchar int
PUBDOITY varchar int
PUBNEWFN varchar int
PUBPHOTO varchar int
PUBOPP varchar int
P a g e 13 | 100
15. Donor Datamining | Jalaj Nautiyal
3.0 Missing Data
Analyzing the target dataset, I uncovered following missing data which are listed with the
explanation of data processing done to work with different algorithm.
3.1 Gender
I used TCODE value to identify gender for the records which had missing values of gender.
3.2 Zip
I corrected the zip attribute’s value. For some zip values, it had ‘-‘ at the end of its values. Zip
values were corrected by removing this ‘-’.
3.3 Age and DOB
There were many null values for age attribute. We have a DOB (DateOfBirth) attribute in
our dataset. I referred DOB attribute to find out the missing values of age attribute. But, for
all the missing values of age attribute, the corresponding DOB value was also missing.
I tried to replace the age value using some statistics. I tried to calculate the average age of
current donor’s with respect to state. Since the data with null value of age was very large, I
thought it won’t be a good idea to replace the missing age values as replacing missing age
value might bias the dataset to a great margin.
P a g e 14 | 100
16. Donor Datamining | Jalaj Nautiyal
4.0 Attribute Selection
4.1 Methodology
Step-1: Logic Based Selection
For attribute selection, I analyzed the target dataset with the objective to identify the possible
donor. Some of the assumptions I used when selecting the attribute from list of possible attributes
are listed below. These become the evaluating conditions for my logic based attribute selection:
1. Higher Education implies higher probability of person to be future donor
2. Higher Income(per capita /income level/number of vehicles/median income/number of
employed persons etc.) implies higher probability of person to be future donor
3. House in good locality implies higher probability of person to be future donor
4. Bigger house implies higher probability of person to be future donor
5. Person renting and paying high rent implies higher probability of person to be future donor
6. Person in active duty or if person served in military in past implies higher probability of
person to be future donor
7. Dollar amount of gift, frequency of gift, recency of gift all gives good idea about
probability of future donor/donations
8. Number, Type and recency of promotions and responsiveness of a person to these efforts
gives good idea about probability of future donor/donations.
9. I also identified some attributes which were negatively correlated with the possibility of
future donations, thus making these variables equally important to estimate future donor:
a. Persons living on social security will probably not be a future donor
b. Persons working in professions where there is not much expendable income will
probably not be a future donor
c. Persons living in rural areas are less likely to donate.
I identified in total 122 attributes out of total 448 possible attributes.
After selection of these 122 attributes, I applied attribute selection – Chi Squared Attribute
Evaluator in Weka to cross reference my logic based selection and apply learnings from the class.
The output from Chi Squared Attribute evaluator was analyzed and compared with my subset of
122 attributes and following are list of 95 attributes which matched my logic based selection and
Chi Squared Attribute evaluator output.
4.2 Included Attributes
Attribute Name Attribute Description Reason
TARGET_D
Donation Amount (in $) associated with the
Response to 97NK Mailing
Donor amount gives indication of the
amount of donation a donor can
provide based on the 1997 donation
history
P a g e 15 | 100
17. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
IC5 Per Capita Income
Per capita income gives indication of
the wealth of the donor base. Higher
the per capita income likely will be
the donation probability
ZIP Zipcode
Zip code provides the likelihood of
donor’s with similar income range
and probability of donation
POP901
Number of Persons in donor’s neighborhood, as
collected from the 1990 US Census.
Number of person’s in donor
neighborhood is indicative of income
range as richer neighborhood tends to
have lower number of persons.
AVGGIFT Average dollar amount of gifts to date
Average amount of gift to date is an
attribute which provides good
indication of probability of donation.
HV1
Median Home Value in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Median home value is indication of
income and higher the median home
value higher probability of donation
HV2
Average Home Value in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Average home value is indication of
income and higher the median home
value higher probability of donation
RAMNTALL Dollar amount of lifetime gifts to date
Total Dollar amount of gift to date is
an attribute which provides good
indication of probability of donation.
IC3
Average Household Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Average household income is
indication of income and higher the
average household income higher
probability of donation
IC2
Median Family Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Median Family income is indication
of income and higher the median
family income higher probability of
donation
IC4
Average Family Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Average Family income is indication
of income and higher the average
family income higher probability of
donation
IC1
Median Household Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Median household income is
indication of income and higher the
median household income higher
probability of donation
OSOURCE
Code indicating which mailing list the donor was
originally acquired from
This attribute indicates the chances of
donation based on the marketing
approach to the donor.
LASTGIFT
Dollar amount of most recent gift from giving
history file
Dollar amount of recent gift is an
attribute which provides good
indication of probability of donation
as higher dollar amount of recent gift
P a g e 16 | 100
18. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
gives higher likelihood of donor to
repeat donation
MAXRAMNT
Dollar amount of largest gift to date from giving
history file
Dollar amount of largest gift is an
attribute which provides good
indication of probability of donation
as larger the gift amount higher will
be the likelihood of donor to repeat
donation
RFA_3 Donor's RFA status as of 96NK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_4
Donor's RFA status as of 96TK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_6
Donor's RFA status as of 96LL promotion date from
promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_8
Donor's RFA status as of 96GK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_2
Donor's RFA status as of 97NK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
FISTDATE Date of first gift from giving history file
The date of first gift gives an idea of
how long the donor has been involved
in donation is good attribute for
estimating future donations.
RFA_12
Donor's RFA status as of 96XK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
MINRAMNT
Dollar amount of smallest gift to date from giving
history file
Dollar amount of the smallest gift is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
RFA_11
Donor's RFA status as of 96X1 promotion date from
promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
NGIFTALL
Number of lifetime gifts to date from promotion
history File
Number of lifetime gifts to date is
good measure to estimate future
donation as higher the number of
lifetime gifts higher will be the
probability of future donations.
RFA_2F Frequency code for RFA_2
Frequency of the RFA measure
provides idea about how frequent has
the past donations by the donor and
P a g e 17 | 100
19. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
hence is good estimator for future
donations
RFA_7 Donor's RFA status as of 96G1 promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_2A Donation Amount code for RFA_2
Amount of the RFA measure provides
idea about how much has the past
donations by the donor and hence is
good estimator for future donations
RFA_9 Donor's RFA status as of 96CC promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
MAXRDATE Date associated with the largest gift to date
Date associated with the largest gift is
important to understand how long ago
has the donor donated the largest gift
and is good estimator of future
donation
NUMPROM Lifetime number of promotions received to date
Lifetime number of promotions is
good estimator for future donations as
it shows how much the donor is
responsive to the marketing effort for
donation
CARDGIFT Number of lifetime gifts to card promotions to date
Number of lifetime gifts to card
promotions is good estimator for
future donations as it shows how
much the donor is responsive to the
marketing effort for donation
RFA_16 Donor's RFA status as of 95LL promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
ODATEDW Date of donor's first gift
The date of first gift gives an idea of
how long the donor has been involved
in donation is good attribute for
estimating future donations.
NEXTDATE Date of second gift
The date of second gift gives an idea
of how long the donor has been
involved in donation is good attribute
for estimating future donations.
RFA_14 Donor's RFA status as of 95NK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
HVP1
Percent Home Value >= $200,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
RFA_18 Donor's RFA status as of 95GK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
P a g e 18 | 100
20. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
RFA_5 Donor's RFA status as of 96SK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
MINRDATE Date associated with the smallest gift to date
Date of the smallest gift is negatively
correlated to the future donation and
hence is a good attribute to estimate
future donation
HVP2
Percent Home Value >= $150,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
HVP6
Percent Home Value >= $300,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
RFA_19 Donor's RFA status as of 95CC promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_10 Donor's RFA status as of 96WL promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RP1 Percent Renters Paying >= $500 per Month
Rent of home is indication of income
and higher the home rent higher
probability of donation
CARDPROM
Lifetime number of card promotions received to
date.
Number of lifetime gifts to card
promotions is good estimator for
future donations as it shows how
much the donor is responsive to the
marketing effort for donation
RFA_17 Donor's RFA status as of 95G1 promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
HHAS3
Percent Households w/ Interest, Rental or Dividend
Income in donor’s neighborhood, as collected from
the 1990 US Census.
Households with interest, rental or
dividend income is indication of
income and higher the attribute higher
probability of donation
HVP3
Percent Home Value >= $100,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
RFA_13 Donor's RFA status as of 95FS promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
SEC5
Percent Persons in College in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent persons in college is
indication of education and better the
education higher will be probability
of donation
LFC3
Percent Females in Labor Force in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent female in labor is indication
of income and higher the percent in
P a g e 19 | 100
21. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
labor force higher will be probability
of donation
HVP5
Percent Home Value >= $50,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
NUMPRM12
Number of promotions received in the last 12
months
Number of promotions received in
last 12 months is good estimator for
future donations as it shows how
much the donor is responsive to the
marketing effort for donation
EC4
Percent Adults 25+ Completed High School or
Equivalency in donor’s neighborhood, as collected
from the 1990 US Census.
Percent adults completing high school
or equivalent education is indication
of education and better the education
higher will be probability of donation
HU5
Percent Seasonal/Recreational Vacant Units in
donor’s neighborhood, as collected from the 1990
US Census.
Percent Seasonal/Recreational vacant
unit is indication of income and
higher the attribute higher probability
of donation
HUR2
Percent >= 6 Room Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent 6+ room houses is indication
of income and higher the attribute
higher probability of donation
HVP4
Percent Home Value >= $75,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
LFC5
Percent Adult Females Employed in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent female in labor is indication
of income and higher the percent in
labor force higher will be probability
of donation
DW1
Percent Single Unit Structure in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent Single unit structure is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
HU1
Percent Owner Occupied Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
AFC1
Percent Adults in Active Military Service in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent Adults in active military
service is indication of association
with military and higher the percent
in active military service higher will
be probability of donation
VC3
Percent WW2 Veterans Age 16+ in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent WW2 veterans is indication
of association with military and
higher the percent of WW2 veterans
higher will be probability of donation
P a g e 20 | 100
22. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
RP3
Percent Renters Paying >= $300 per Month in
donor’s neighborhood, as collected from the 1990
US Census.
Rent of home is indication of income
and higher the home rent higher
probability of donation
DW2
Percent Detached Single Unit Structure in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent detached single unit structure
is negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
RFA_22 Donor's RFA status as of 95XK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
WWIIVETS % WWII Vets
Percent WW2 veterans is indication
of association with military and
higher the percent of WW2 veterans
higher will be probability of donation
HHN2
Percent 2 Person Households in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent 2 person household is
indication of no future liability on
part of donor and hence the person is
more likely to be future donor
VOC2
Percent Households w/ 2+ Vehicles in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent household with 2+ vehicles is
indication of income and higher the
number of vehicles higher probability
of donation
HU2
Percent Renter Occupied Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent renter occupied housing is
positively correlated to the future
donation and hence is a good attribute
to estimate future donation provided
the rent paid is high
AGE Overlay Age
Higher the age higher the probability
of future donations.
HC4
Percent Owner Occupied Structures Built Since
1985 in donor’s neighborhood, as collected from the
1990 US Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as owner
has only one asset
LFC7
Percent 2 Parent Earner Families in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent 2 parent earner family is
indication of income and higher the
percent higher probability of donation
RP2
Percent Renters Paying >= $400 per Month in
donor’s neighborhood, as collected from the 1990
US Census.
Rent of home is indication of income
and higher the home rent higher
probability of donation
AFC2
Percent Males in Active Military Service in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent males in active military
service is indication of association
with military and higher the percent
of attribute higher will be probability
of donation
P a g e 21 | 100
23. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
IC23
Percent Families w/ Income >= $150,000 in
donor’s neighborhood, as collected from the 1990
US Census.
Percent families with income >150k
is indication of income and higher the
attribute higher probability of
donation
OCC9
Percent Farmers in donor’s neighborhood, as
collected from the 1990 US Census.
Percent farmers is negatively
correlated to the future donation and
hence is a good attribute to estimate
future donation as farmers generally
do not have expendable surplus
IC14
Percent Households w/ Income >= $150,000 in
donor’s neighborhood, as collected from the 1990
US Census.
Percent households with income
>150k is indication of income and
higher the attribute higher probability
of donation
EC1
Median Years of School Completed by Adults 25+
in donor’s neighborhood, as collected from the 1990
US Census.
Median years of school completed is
indication of education and better the
education higher will be probability
of donation
LASTDATE Date associated with the most recent gift
The date of recent gift gives an idea
of how long the donor has been
involved in donation is good attribute
for estimating future donations.
LFC6
Percent Mothers Employed Married and Single in
donor’s neighborhood, as collected from the 1990
US Census.
Percent mothers employed, marries
and single gives idea about the
income, liability of the neighborhood
and is good indication of future
donations.
HUPA6
Percent Renter Occupied, 5+ Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent renter occupied 5+ units is
indication of income and higher the
attribute higher probability of
donation
RFA_24 Donor's RFA status as of 94NK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
HU4
Percent Vacant Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent vacant housing units in donor
neighborhood is good indication of
future donations as higher the vacant
units implies higher number of
second homes.
HHAS1
Percent Households on Social Security in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent households on social security
is negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as person
on social security seldom donate.
HC19
Percent Housing Units w/ Public Sewer Source in
donor’s neighborhood, as collected from the 1990
US Census.
Percent housing with public sewer is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as
P a g e 22 | 100
24. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
housings with public sewer generally
are low income housing
VOC1
Percent Households w/ 1+ Vehicles in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent household with 1+ vehicles is
indication of income and higher the
number of vehicles higher probability
of donation
HC7
Percent Owner Occupied Structures Built Since
1960 in donor’s neighborhood, as collected from the
1990 US Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as owner
has only one asset
POP90C2
Percent Population Outside Urbanized Area in
donor’s neighborhood, as collected from the 1990
US Census.
Percent outside urbanized area is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as people
outside urbanized area seldom donate
EIC1
Percent Employed in Agriculture in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent employed in agriculture is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as people
employed in agriculture seldom has
expendable income
HC8
Percent Owner Occupied Structures Built Prior to
1960 in donor’s neighborhood, as collected from the
1990 US Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as owner
has only one asset
MALEMILI
% Males active in the Military in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent males active in military is
indication of association with military
and higher the percent of percent
higher will be probability of donation
LFC2
Percent Adult Males in Labor Force in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent adult males in labor force is
indication of income and higher the
percent higher will be probability of
donation
OEDC5
Percent Private Profit Wage or Salaried Worker in
donor’s neighborhood, as collected from the 1990
US Census.
Percent private profit wage or salaried
worker is indication of income and
higher the percent higher will be
probability of donation as generally
private profit wage is high and person
has expendable income
In analyzing the output from Chi Squared Attribute evaluator output I also eliminated some of
the high rank attributes from the output generated by Weka. The rationale behind the elimination
was either due to irrelevance to the objective of identifying future donor or due to
multicollinearity in the attributes (more than one attributes conveying same information). The
P a g e 23 | 100
25. Donor Datamining | Jalaj Nautiyal
purpose is to select appropriate number of attributes which help predict future donor. Following
is the list of attribute which were eliminated from my final attribute selection with reason for
elimination.
4.3 Omitted Attributes
Attribute Name Attribute Description Reason
CONTROLN Control number (unique record identifier)
This is a unique identifier number and
adds no value in identifying future donor.
POP903
Number of Households in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood).
Thus adding no additional value in
identifying future donor.
POP902
Number of Families in donor’s neighborhood, as
collected from the 1990 US Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood –
POP903). Thus adding no additional
value in identifying future donor.
DOB Date of birth of Donor
This attribute is multicollinear with
attribute in our final attribute list (AGE).
Thus adding no additional value in
identifying future donor.
HHP2
Average Person Per Household in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood –
POP903). Thus adding no additional
value in identifying future donor.
HHP1
Median Person Per Household in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood –
POP903). Thus adding no additional
value in identifying future donor.
MSA MSA Code
This is some kind of code and adds no
value in identifying future donor.
ADI ADI Code
This is some kind of code and adds no
value in identifying future donor.
DMA DMA Code
This is some kind of code and adds no
value in identifying future donor.
TPE13 Percent Traveling 15 - 59 Minutes to Work
This attribute adds no value in identifying
future donor as traveling time doesn’t
decide if a person will be future donor.
ETHC3
Percent White Age 60+ in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as person’s race doesn’t
decide if a person will be future donor.
P a g e 24 | 100
26. Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
TCODE Donor title code
This attribute adds no value in identifying
future donor as person’s title doesn’t
decide if a person will be future donor.
DW7
Percent Group Quarters in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as group quarters doesn’t
decide if a person will be future donor.
POBC2
Percent Born in State of Residence in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as person’s affinity to a
place doesn’t decide if a person will be
future donor.
MARR4
Percent Never Married in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
VC4
Percent Veterans Serving After May 1975 Only
in donor’s neighborhood, as collected from the
1990 US Census.
This attribute is multicollinear with more
than one attribute in our final attribute list
(Number of military service persons,
active military etc.). Thus adding no
additional value in identifying future
donor.
DW9
Non-Institutional Group Quarters in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
HU3
Percent Occupied Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute is multicollinear with more
than one attribute in our final attribute list
(Number of vacant house, house type,
number of houses etc.). Thus adding no
additional value in identifying future
donor.
ETH1
Percent White in donor’s neighborhood, as
collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as person’s race doesn’t
decide if a person will be future donor.
MARR1
Percent Married in donor’s neighborhood, as
collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
HHN1
Percent 1 Person Households in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as living alone doesn’t
decide if a person will be future donor.
HHD3
Percent Married Couple Families in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
P a g e 25 | 100
27. Donor Datamining | Jalaj Nautiyal
Step-2: Attribute Filtration based on Multiple Attribute Selection Methods.
As the number of attributes (95 attributes) resulted in unsatisfied statistics on the trained model. I
wanted to run Attribute selection methods available in Weka to gain better understanding of the
impact of attributes on model accuracy.
I ran Chi Squared, GainRationAttributeEval, InfoGainAttributeEval and following are the results
of these three evaluators from Weka.
Chi Sqaured Gains Ratio Information Gain
Ranked attributes: Ranked attributes: Ranked attributes:
472 TARGET_D 472 TARGET_D 73 IC5
470 CONTROLN 362 ADATE_2 74 ZIP
203 IC5 363 ADATE_3 76 POP901
5 ZIP 470 CONTROLN 79 HV2
76 POP901 203 IC5 78 HV1
469 AVGGIFT 5 ZIP 77 AVGGIFT
146 HV1 14 MDMAUD 83 IC4
147 HV2 76 POP901 82 IC2
78 POP903 469 AVGGIFT 81 IC3
77 POP902 366 ADATE_6 84 IC1
457 RAMNTALL 147 HV2 80 RAMNTALL
201 IC3 146 HV1 85 OSOURCE
200 IC2 78 POP903 93 FISTDATE
202 IC4 77 POP902 90 RFA_6
199 IC1 364 ADATE_4 8 MAXRDATE
8 DOB 95 ETH12 47 VOC2
2 OSOURCE 8 DOB 9 NUMPROM
464 LASTGIFT 41 PUBPHOTO 91 RFA_8
462 MAXRAMNT 457 RAMNTALL 94 RFA_12
386 RFA_3 475 RFA_2F 15 HVP1
136 HHP2 476 RFA_2A 86 LASTGIFT
135 HHP1 380 ADATE_20 88 RFA_3
387 RFA_4 202 IC4 2 RFA_11
389 RFA_6 201 IC3 45 WWIIVETS
391 RFA_8 479 MDMAUD_A 42 RP3
196 MSA 200 IC2 18 MINRDATE
385 RFA_2 199 IC1 67 POP90C2
466 FISTDATE 464 LASTGIFT 89 RFA_4
395 RFA_12 98 ETH15 30 LFC3
460 MINRAMNT 2 OSOURCE 5 RFA_7
394 RFA_11 385 RFA_2 87 MAXRAMNT
458 NGIFTALL 462 MAXRAMNT 13 NEXTDATE
475 RFA_2F 145 DW9 7 RFA_9
390 RFA_7 367 ADATE_7 34 HU5
476 RFA_2A 386 RFA_3 64 HC19
392 RFA_9 387 RFA_4 60 HUPA6
463 MAXRDATE 409 MAXADATE 66 HC7
197 ADI 460 MINRAMNT 69 HC8
410 NUMPROM 80 POP90C2 38 DW1
P a g e 26 | 100
29. Donor Datamining | Jalaj Nautiyal
Chi Sqaured Gains Ratio Information Gain
Ranked attributes: Ranked attributes: Ranked attributes:
329 POBC2 81 POP90C3 28 RFA_13
134 MARR4 410 NUMPROM 56 IC14
193 RP2 171 ETHC5 6 RFA_2A
304 AFC2 221 IC23 70 MALEMILI
221 IC23 412 NUMPRM12 55 OCC9
312 VC4 187 HUPA3 12 ODATEDW
262 OCC9 3 TCODE 54 IC23
212 IC14 399 RFA_16 40 AFC1
145 DW9 332 LSC3
152 HU3 397 RFA_14
290 EC1 290 EC1
465 LASTDATE 91 ETH8
84 ETH1 334 VOC1
249 LFC6 356 HC20
190 HUPA6 235 TPE7
407 RFA_24 174 HVP2
153 HU4 465 LASTDATE
222 HHAS1 231 TPE3
355 HC19 339 HC3
334 VOC1 368 ADATE_8
131 MARR1 401 RFA_18
343 HC7 327 ANC15
125 HHN1 87 ETH4
80 POP90C2 35 MAGMALE
267 EIC1 331 LSC2
344 HC8 28 HIT
44 MALEMILI 238 PEC1
245 LFC2 154 HU5
157 HHD3 190 HUPA6
287 OEDC5 177 HVP5
I then conducted overlapping (intersection) analysis of the attributes across above evaluator
methods. I then selected high ranked attributes across these evaluator results for selection of final
(45 attributes) listed below.
Final Attributes Attribute Description
IC5 Per Capita Income
ZIP Zipcode
POP901 Number of Persons in donor’s neighborhood, as collected from the 1990 US Census.
AVGGIFT Average dollar amount of gifts to date
HV1 Median Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
HV2 Average Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
P a g e 28 | 100
30. Donor Datamining | Jalaj Nautiyal
Final Attributes Attribute Description
IC4 Average Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC3 Average Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC2 Median Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC1 Median Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
RAMNTALL Dollar amount of lifetime gifts to date
DOB Date of birth of Donor
OSOURCE Code indicating which mailing list the donor was originally acquired from
RFA_4 Donor's RFA status as of 96TK promotion date from promotion history File
RFA_6 Donor's RFA status as of 96LL promotion date from promotion history File
RFA_8 Donor's RFA status as of 96GK promotion date from promotion history File
RFA_3 Donor's RFA status as of 96NK promotion date
FISTDATE Date of first gift from giving history file
RFA_12 Donor's RFA status as of 96XK promotion date from promotion history File
MAXRAMNT Dollar amount of largest gift to date from giving history file
RFA_2 Donor's RFA status as of 97NK promotion date from promotion history File
RFA_9 Donor's RFA status as of 96CC promotion date
RFA_7 Donor's RFA status as of 96G1 promotion date
RFA_11 Donor's RFA status as of 96X1 promotion date from promotion history File
RFA_2A Donation Amount code for RFA_2
RFA_2F Frequency code for RFA_2
HVP2 Percent Home Value >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
HVP6 Percent Home Value >= $300,000 in donor’s neighborhood, as collected from the 1990 US Census.
RFA_18 Donor's RFA status as of 95GK promotion date
WWIIVETS % WWII Vets
HHAS3
Percent Households w/ Interest, Rental or Dividend Income in donor’s neighborhood, as collected from the 1990
US Census.
HUR2 Percent >= 6 Room Housing Units in donor’s neighborhood, as collected from the 1990 US Census.
NGIFTALL Number of lifetime gifts to date from promotion history File
HVP5 Percent Home Value >= $50,000 in donor’s neighborhood, as collected from the 1990 US Census.
CARDGIFT Number of lifetime gifts to card promotions to date
LASTGIFT Dollar amount of most recent gift from giving history file
P a g e 29 | 100
31. Donor Datamining | Jalaj Nautiyal
Final Attributes Attribute Description
AFC1 Percent Adults in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census.
AFC2 Percent Males in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census.
IC23 Percent Families w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
IC14 Percent Households w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
EC1
Median Years of School Completed by Adults 25+ in donor’s neighborhood, as collected from the 1990 US
Census.
LASTDATE Date associated with the most recent gift
VOC1 Percent Households w/ 1+ Vehicles in donor’s neighborhood, as collected from the 1990 US Census.
POP90C2 Percent Population Outside Urbanized Area in donor’s neighborhood, as collected from the 1990 US Census.
P a g e 30 | 100
32. Donor Datamining | Jalaj Nautiyal
5.0 Models
5.1 Model-1
NaïveBayes 10 Fold Cross Validation Statistics
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
8,843 1,156 a = 0
1,178 275 b = 1
• TP - 8,843 non-donor instances were correctly identified as non-donors by the model.
• TN- 275 donor instances were correctly identified as donors by the model.
• FP- 1,178 donor instances were incorrectly identified as non-donors by the model.
• FN – 1,156 non-donor instances were incorrectly identified as donors by the model.
P a g e 31 | 100
33. Donor Datamining | Jalaj Nautiyal
5.2 Model-2:
NaiveBayes 10 Fold Cross Validation with Test Set Statistics
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3265 458 a = 0
374 110 b = 1
• TP – 3,265 non-donor instances were correctly identified as non-donors by the model.
• TN- 110 donor instances were correctly identified as donors by the model.
• FP- 374donor instances were incorrectly identified as non-donors by the model.
• FN – 458 non-donor instances were incorrectly identified as donors by the model.
P a g e 32 | 100
34. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3289 434 a = 0
367 117 b = 1
• TP – 3,289 non-donor instances were correctly identified as non-donors by the model.
• TN- 117 donor instances were correctly identified as donors by the model.
• FP- 367 donor instances were incorrectly identified as non-donors by the model.
• FN – 434 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
P a g e 33 | 100
35. Donor Datamining | Jalaj Nautiyal
Other Models
5.3 Model-3
J48Graft Training Set Model with Test Set Statistics
Algorithm: J48Graft algorithm was used.
Test Options:
• J48Graft algorithm model based on training set.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3,723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 34 | 100
36. Donor Datamining | Jalaj Nautiyal
J48Graft Training Set Model with Evaluation Set Statistics
Algorithm: J48Graft algorithm was used.
Test Options:
• J48Graft algorithm model based on training set.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3,723 non-donor instances were correctly identified as non-donors by the model.
P a g e 35 | 100
37. Donor Datamining | Jalaj Nautiyal
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
Snapshot of Tree
The number of Non-donors in the training set for this algorithm were 11,452 and donors were
1453. When J48graft algorithm was ran the algorithm used number of non-donors as the root
node which did not create any further classification beyond what is shown in the figure below.
Conclusion:
Model classified the number of non-donors and donors same as specified in the dataset.
Model failed to calculate the True Negative and False Negative .
P a g e 36 | 100
38. Donor Datamining | Jalaj Nautiyal
5.4 Model-4
Decision Stump 10 Fold Cross Validation Model Statistics
Algorithm: Decision Stump algorithm was used.
Test Options:
• Decision Stump algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
9999 0 a = 0
1453 0 b = 1
• TP – 9,999 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 1453 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 37 | 100
39. Donor Datamining | Jalaj Nautiyal
5.5 Model-5
Decision Stump 10 Fold Cross Validation Model with Test Set Statistics
Algorithm: Decision Stump algorithm was used.
Test Options:
• Decision Stump algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3732 0 a = 0
484 0 b = 1
• TP - 3732 non-donor instances were correctly identified as non-donors by the model.
• TN - 0 donor instances were correctly identified as donors by the model.
• FP - 484 donor instances were incorrectly identified as non-donors by the model.
• FN - 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 38 | 100
40. Donor Datamining | Jalaj Nautiyal
Decision Stump Cross Validation Model with Evaluation Set Statistics
Algorithm: Decision Stump algorithm was used.
Test Options:
• Decision Stump algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 39 | 100
41. Donor Datamining | Jalaj Nautiyal
Conclusion:
• Model classified the number of non-donors and donors same as specified in the dataset.
• Model failed to calculate the True Positive and False Negative.
5.6 Model-6
OneR 10 Fold Cross Validation Model Statistics
Algorithm: OneR algorithm was used.
Test Options:
• OneR algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
9571 428 a = 0
1406 47 b = 1
• TP – 9,571 non-donor instances were correctly identified as non-donors by the model.
• TN- 47 donor instances were correctly identified as donors by the model.
• FP- 1406 donor instances were incorrectly identified as non-donors by the model.
• FN - 428 non-donor instances were incorrectly identified as donors by the model.
P a g e 40 | 100
42. Donor Datamining | Jalaj Nautiyal
5.7 Model-7
OneR 10 Fold Cross Validation Model with Test Set Statistics
Algorithm: OneR algorithm was used.
Test Options:
• OneR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3564 159 a = 0
455 29 b = 1
• TP – 3564 non-donor instances were correctly identified as non-donors by the model.
• TN- 29 donor instances were correctly identified as donors by the model.
• FP- 455 donor instances were incorrectly identified as non-donors by the model.
• FN - 159 non-donor instances were incorrectly identified as donors by the model.
P a g e 41 | 100
43. Donor Datamining | Jalaj Nautiyal
OneR 10 Fold Cross Validation Model with Evaluation Set Statistics
Algorithm: OneR algorithm was used.
Test Options:
• OneR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3566 157 a = 0
454 30 b = 1
• TP – 3566 non-donor instances were correctly identified as non-donors by the model.
• TN- 30 donor instances were correctly identified as donors by the model.
• FP- 454 donor instances were incorrectly identified as non-donors by the model.
• FN – 157 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
P a g e 42 | 100
44. Donor Datamining | Jalaj Nautiyal
5.8 Model-8:
ZeroR 10 Fold Cross Validation Model Statistics
Algorithm: ZeroR algorithm was used.
Test Options:
• ZeroR algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
9999 0 a = 0
1453 0 b = 1
• TP – 9,999 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 1453 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 43 | 100
45. Donor Datamining | Jalaj Nautiyal
5.9 Model-9
ZeroR 10 Fold Cross Validation Model with Test Set Statistics
Algorithm: ZeroR algorithm was used to generate model.
Test Options:
• ZeroR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
A B Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 44 | 100
46. Donor Datamining | Jalaj Nautiyal
ZeroR 10 Fold Cross Validation Model with Evaluation Set Statistics
Algorithm: ZeroR algorithm was used to generate model.
Test Options:
• ZeroR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 45 | 100
47. Donor Datamining | Jalaj Nautiyal
Conclusion:
• Model classified the number of nondonors and donors same as specified in the dataset.
• Model failed to calculate the True Positive and False Negative.
6.0 Different number of attributes but same number of records
using NaiveBayes Model
Note: In all the below mentioned model TARGET_B attribute has been used.
6.1 5 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1 were the attributes used to create model.
NaiveBayes 10 Fold Cross Validation Statistics for 5 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
P a g e 46 | 100
48. Donor Datamining | Jalaj Nautiyal
a b Classified as
9795 204 a = 0
1415 38 b = 1
• TP – 9795 non-donor instances were correctly identified as non-donors by the model.
• TN- 38 donor instances were correctly identified as donors by the model.
• FP- 1415 donor instances were incorrectly identified as non-donors by the model.
• FN – 204 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 5 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3649 74 a = 0
473 11 b = 1
P a g e 47 | 100
49. Donor Datamining | Jalaj Nautiyal
• TP –3649 non-donor instances were correctly identified as non-donors by the model.
• TN- 11 donor instances were correctly identified as donors by the model.
• FP- 473 donor instances were incorrectly identified as non-donors by the model.
• FN – 74 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 5 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3651 72 a = 0
471 13 b = 1
P a g e 48 | 100
50. Donor Datamining | Jalaj Nautiyal
• TP –3651 non-donor instances were correctly identified as non-donors by the model.
• TN- 13 donor instances were correctly identified as donors by the model.
• FP- 471 donor instances were incorrectly identified as non-donors by the model.
• FN – 72 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.2 10 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1 were the attributes used to create
model.
NaiveBayes 10 Fold Cross Validation Statistics 10 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
P a g e 49 | 100
51. Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9320 679 a = 0
1329 124 b = 1
• TP – 9320 non-donor instances were correctly identified as non-donors by the model.
• TN- 124 donor instances were correctly identified as donors by the model.
• FP- 1329 donor instances were incorrectly identified as non-donors by the model.
• FN – 679 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 10 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
P a g e 50 | 100
52. Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3472 251 a = 0
438 46 b = 1
• TP - 3472 non-donor instances were correctly identified as non-donors by the model.
• TN- 46 donor instances were correctly identified as donors by the model.
• FP- 438 donor instances were incorrectly identified as non-donors by the model.
• FN – 251 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 10 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
P a g e 51 | 100
53. Donor Datamining | Jalaj Nautiyal
a b Classified as
3465 258 a = 0
439 45 b = 1
• TP - 3465non-donor instances were correctly identified as non-donors by the model.
• TN- 45 donor instances were correctly identified as donors by the model.
• FP- 439 donor instances were incorrectly identified as non-donors by the model.
• FN – 258 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.3 15 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 were the attributes used to create model.
P a g e 52 | 100
54. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation Statistics for 15 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9493 506 a = 0
1351 102 b = 1
• TP - 9493non-donor instances were correctly identified as non-donors by the model.
• TN- 102 donor instances were correctly identified as donors by the model.
• FP- 1351donor instances were incorrectly identified as non-donors by the model.
• FN - 506 non-donor instances were incorrectly identified as donors by the model.
P a g e 53 | 100
55. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 15 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3525 198 a = 0
442 42 b = 1
• TP - 3525 non-donor instances were correctly identified as non-donors by the model.
• TN- 42 donor instances were correctly identified as donors by the model.
• FP- 442 donor instances were incorrectly identified as non-donors by the model.
• FN - 198 non-donor instances were incorrectly identified as donors by the model.
P a g e 54 | 100
56. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 15 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3514 209 a = 0
445 39 b = 1
• TP - 3514 non-donor instances were correctly identified as non-donors by the model.
• TN - 39 donor instances were correctly identified as donors by the model.
• FP - 445 donor instances were incorrectly identified as non-donors by the model.
• FN - 209 non-donor instances were incorrectly identified as donors by model.
P a g e 55 | 100
57. Donor Datamining | Jalaj Nautiyal
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.4 20 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT were the attributes used
to create model.
NaiveBayes 10 Fold Cross Validation for 20 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9405 594 a = 0
1309 144 b = 1
P a g e 56 | 100
58. Donor Datamining | Jalaj Nautiyal
• TP - 9405 non-donor instances were correctly identified as non-donors by the model.
• TN- 144 donor instances were correctly identified as donors by the model.
• FP- 1309 donor instances were incorrectly identified as non-donors by the model.
• FN - 594 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 20 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
P a g e 57 | 100
59. Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3477 246 a = 0
431 53 b = 1
• TP - 3477 non-donor instances were correctly identified as non-donors by the model.
• TN- 53 donor instances were correctly identified as donors by the model.
• FP- 431 donor instances were incorrectly identified as non-donors by the model.
• FN - 246 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 20 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
P a g e 58 | 100
60. Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3484 239 a = 0
431 53 b = 1
• TP - 3484 non-donor instances were correctly identified as non-donors by the model.
• TN - 53 donor instances were correctly identified as donors by the model.
• FP - 431 donor instances were incorrectly identified as non-donors by the model.
• FN - 239 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.5 25 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A were the attributes used to create model.
P a g e 59 | 100
61. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation Statistics for 25 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9066 933 a = 0
1239 214 b = 1
• TP - 9066 non-donor instances were correctly identified as non-donors by the model.
• TN- 214 donor instances were correctly identified as donors by the model.
• FP- 1239 donor instances were incorrectly identified as non-donors by the model.
• FN - 933 non-donor instances were incorrectly identified as donors by the model.
P a g e 60 | 100
62. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 25 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3350 373 a = 0
408 76 b = 1
• TP - 3350 non-donor instances were correctly identified as non-donors by the model.
• TN- 76 donor instances were correctly identified as donors by the model.
• FP- 408 donor instances were incorrectly identified as non-donors by the model.
• FN - 373 non-donor instances were incorrectly identified as donors by the model.
P a g e 61 | 100
63. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 25 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3380 343 a = 0
386 98 b = 1
• TP - 3380 non-donor instances were correctly identified as non-donors by the model.
• TN - 98 donor instances were correctly identified as donors by the model.
• FP - 386 donor instances were incorrectly identified as non-donors by the model.
• FN - 343 non-donor instances were incorrectly identified as donors by the model.
P a g e 62 | 100
64. Donor Datamining | Jalaj Nautiyal
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.5 30 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS were the attributes
used to create model.
NaiveBayes 10 Fold Cross Validation 30 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
8970 1029 a = 0
1213 240 b = 1
P a g e 63 | 100
65. Donor Datamining | Jalaj Nautiyal
• TP - 8970 non-donor instances were correctly identified as non-donors by the model.
• TN- 240 donor instances were correctly identified as donors by the model.
• FP- 1213 donor instances were incorrectly identified as non-donors by the model.
• FN - 1029 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 30 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3305 418 a = 0
391 93 b = 1
P a g e 64 | 100
66. Donor Datamining | Jalaj Nautiyal
• TP - 3305 non-donor instances were correctly identified as non-donors by the model.
• TN- 93 donor instances were correctly identified as donors by the model.
• FP-391 donor instances were incorrectly identified as non-donors by the model.
• FN - 418 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 30 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3332 391 a = 0
377 107 b = 1
P a g e 65 | 100
67. Donor Datamining | Jalaj Nautiyal
• TP - 3332 non-donor instances were correctly identified as non-donors by the model.
• TN - 107 donor instances were correctly identified as donors by the model.
• FP - 377 donor instances were incorrectly identified as non-donors by the model.
• FN - 391 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.6 35 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2,
NGIFTALL, HVP5, CARDGIFT were the attributes used to create model.
NaiveBayes 10 Fold Cross Validation with 35 attributes
P a g e 66 | 100
68. Donor Datamining | Jalaj Nautiyal
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
8836 1136 a = 0
1187 266 b = 1
• TP - 8836 non-donor instances were correctly identified as non-donors by the model.
• TN- 266 donor instances were correctly identified as donors by the model.
• FP- 1187 donor instances were incorrectly identified as non-donors by the model.
• FN - 1136 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 35 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
P a g e 67 | 100
69. Donor Datamining | Jalaj Nautiyal
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3274 449 a = 0
380 104 b = 1
• TP - 3274 non-donor instances were correctly identified as non-donors by the model.
• TN- 104 donor instances were correctly identified as donors by the model.
• FP-380 donor instances were incorrectly identified as non-donors by the model.
• FN - 449 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 35 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
P a g e 68 | 100
70. Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3296 427 a = 0
372 112 b = 1
• TP - 3296 non-donor instances were correctly identified as non-donors by the model.
• TN - 112 donor instances were correctly identified as donors by the model.
• FP - 372 donor instances were incorrectly identified as non-donors by the model.
• FN - 427 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.7 40 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2,
NGIFTALL, HVP5, CARDGIFT, LASTGIFT, AFC1, AFC2, IC23, IC14 were the attributes used
to create model.
P a g e 69 | 100
71. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation Statistics with 40 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
8835 1164 a = 0
1185 268 b = 1
• TP - 8835 non-donor instances were correctly identified as non-donors by the model.
• TN- 268 donor instances were correctly identified as donors by the model.
• FP- 1185 donor instances were incorrectly identified as non-donors by the model.
• FN - 1164 non-donor instances were incorrectly identified as donors by the model.
P a g e 70 | 100
72. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 40 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3259 464 a = 0
374 110 b = 1
• TP - 3259 non-donor instances were correctly identified as non-donors by the model.
• TN- 110 donor instances were correctly identified as donors by the model.
• FP-374 donor instances were incorrectly identified as non-donors by the model.
• FN - 464 non-donor instances were incorrectly identified as donors by the model.
P a g e 71 | 100
73. Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 40 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3290 433 a = 0
368 116 b = 1
• TP - 3290 non-donor instances were correctly identified as non-donors by the model.
• TN - 112 donor instances were correctly identified as donors by the model.
• FP - 368 donor instances were incorrectly identified as non-donors by the model.
• FN - 433 non-donor instances were incorrectly identified as donors by the model.
P a g e 72 | 100
74. Donor Datamining | Jalaj Nautiyal
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
P a g e 73 | 100
75. Donor Datamining | Jalaj Nautiyal
7.0 Performance Metrics
7.1 Calculations for Each Model – Precision and Sensitivity
Model -
Number
Model Description
0 1 0 1
a b Classified as TP 8,843 275 TP 8,843 275
8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178
1,178 275 b = 1 PVV 0.882 0.192 Recall 0.884 0.189
0 1 0 1
a b Classified as TP 3,265 110 TP 3,265 110
3265 458 a = 0 FP 374 458 FN 458 374
374 110 b = 1 PVV 0.897 0.194 Recall 0.877 0.227
0 1 0 1
a b Classified as TP 3,289 117 TP 3,289 117
3289 434 a = 0 FP 367 434 FN 434 367
367 117 b = 1 PVV 0.900 0.212 Recall 0.883 0.242
0 1 0 1
a b Classified as TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,571 47 TP 9,571 47
9571 428 a = 0 FP 1,406 428 FN 428 1,406
1406 47 b = 1 PVV 0.872 0.099 Recall 0.957 0.032
0 1 0 1
a b Classified as TP 3,564 29 TP 3,564 29
3564 159 a = 0 FP 455 159 FN 159 455
455 29 b = 1 PVV 0.887 0.154 Recall 0.957 0.060
0 1 0 1
a b Classified as TP 3,566 30 TP 3,566 30
3566 157 a = 0 FP 454 157 FN 157 454
454 30 b = 1 PVV 0.887 0.160 Recall 0.958 0.062
5.7 Model-7
One R Cross
Validation Model with
Test Set Statistics
One R Cross
Validation Model with
Evaluation Set Statistics
5.6 Model-6
OneRCross Validation
Model Statistics
5.5 Model-5
Decision Stump Cross
Validation Model with
test set statistics
Decision Stump Cross
Validation Model with
Evaluate set statistics
J48Graft Training Set
Model with Test set
statistics
J48Graft Training Set
Model with Evaluate set
statistics
5.3 Model-3
Decision Stump Cross
Validation Model
Statistics
5.4 Model-4
5.1 Model-1
NaïveBayes Ten Fold
Cross Validation
Statistics
Confusion Matrix
5.2 Model-2
NaiveBayes 10 Fold
Cross validation with
Test set statistics.
NaiveBayes 10 Fold
Cross validation with
Evaluation set statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)
P a g e 74 | 100
76. Donor Datamining | Jalaj Nautiyal
Model -
Number
Model Description
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as
TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,795 38 TP 9,795 38
9795 204 a = 0 FP 1,415 204 FN 204 1,415
1415 38 b = 1 PVV 0.874 0.157 Recall 0.980 0.026
0 1 0 1
a b Classified as TP 3,649 11 TP 3,649 11
3649 74 a = 0 FP 473 74 FN 74 473
473 11 b = 1 PVV 0.885 0.129 Recall 0.980 0.023
0 1 0 1
a b Classified as TP 3,651 13 TP 3,651 13
3651 72 a = 0 FP 471 72 FN 72 471
471 13 b = 1 PVV 0.886 0.153 Recall 0.981 0.027
0 1 0 1
a b Classified as TP 9,320 124 TP 9,320 124
9320 679 a = 0 FP 1,329 679 FN 679 1,329
1329 124 b = 1 PVV 0.875 0.154 Recall 0.932 0.085
0 1 0 1
a b Classified as TP 3,472 46 TP 3,472 46
3472 251 a = 0 FP 438 251 FN 251 438
438 46 b = 1 PVV 0.888 0.155 Recall 0.933 0.095
0 1 0 1
a b Classified as TP 3,465 46 TP 3,465 46
3465 258 a = 0 FP 439 258 FN 258 439
439 46 b = 1 PVV 0.888 0.151 Recall 0.931 0.095
0 1 0 1
a b Classified as TP 9,493 102 TP 9,493 102
9493 506 a = 0 FP 1,351 506 FN 506 1,351
1351 102 b = 1 PVV 0.875 0.168 Recall 0.949 0.070
0 1 0 1
a b Classified as TP 3,525 42 TP 3,525 42
3525 198 a = 0 FP 442 198 FN 198 442
442 42 b = 1 PVV 0.889 0.175 Recall 0.947 0.087
0 1 0 1
a b Classified as TP 3,514 39 TP 3,514 39
3514 209 a = 0 FP 445 209 FN 209 445
445 39 b = 1 PVV 0.888 0.157 Recall 0.944 0.081
NaiveBayes Cross
Validation Model with
15 Attributes
5.14 Model-14
NaiveBayes Cross
Validation Model with
Test Set with 15
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 15
Attributes
5.15 Model-15
NaiveBayes Cross
Validation Model with
10 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 10
Attributes
5.12 Model-12
NaiveBayes Cross
Validation Model with
Evaluation Set with 10
Attributes
5.13 Model-13
NaïveBayes Cross
Validation Model with 5
Attributes
5.10 Model-10
NaïveBayes Cross
Validation Model with
Test Set Statistics with
5 Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set Statistics
with 5 Attributes
5.11 Model-11
5.8 Model-8
ZeroR Cross Validation
Model Statistics
5.9 Model-9
ZeroR Cross Validation
Model with Test Set
Statistics
ZeroR Cross Validation
Model with Evaluation
Set Statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)
P a g e 75 | 100
77. Donor Datamining | Jalaj Nautiyal
Model -
Number
Model Description
0 1 0 1
a b Classified as TP 9,405 144 TP 9,405 144
9405 594 a = 0 FP 1,309 594 FN 594 1,309
1309 144 b = 1 PVV 0.878 0.195 Recall 0.941 0.099
0 1 0 1
a b Classified as TP 3,477 53 TP 3,477 53
3477 246 a = 0 FP 431 246 FN 246 431
431 53 b = 1 PVV 0.890 0.177 Recall 0.934 0.110
0 1 0 1
a b Classified as TP 3,484 53 TP 3,484 53
3484 239 a = 0 FP 431 239 FN 239 431
431 53 b = 1 PVV 0.890 0.182 Recall 0.936 0.110
0 1 0 1
a b Classified as TP 9,066 214 TP 9,066 214
9066 933 a = 0 FP 1,239 933 FN 933 1,239
1239 214 b = 1 PVV 0.880 0.187 Recall 0.907 0.147
0 1 0 1
a b Classified as TP 3,350 76 TP 3,350 76
3350 373 a = 0 FP 408 373 FN 373 408
408 76 b = 1 PVV 0.891 0.169 Recall 0.900 0.157
0 1 0 1
a b Classified as TP 3,380 98 TP 3,380 98
3380 343 a = 0 FP 386 343 FN 343 386
386 98 b = 1 PVV 0.898 0.222 Recall 0.908 0.202
0 1 0 1
a b Classified as TP 8,970 240 TP 8,970 240
8970 1029 a = 0 FP 1,213 1,029 FN 1,029 1,213
1213 240 b = 1 PVV 0.881 0.189 Recall 0.897 0.165
0 1 0 1
a b Classified as TP 3,305 93 TP 3,305 93
3305 418 a = 0 FP 391 418 FN 418 391
391 93 b = 1 PVV 0.894 0.182 Recall 0.888 0.192
0 1 0 1
a b Classified as TP 3,332 107 TP 3,332 107
3332 391 a = 0 FP 377 391 FN 391 377
377 107 b = 1 PVV 0.898 0.215 Recall 0.895 0.221
0 1 0 1
a b Classified as TP 8,863 266 TP 8,863 266
8863 1136 a = 0 FP 1,187 1,136 FN 1,136 1,187
1187 266 b = 1 PVV 0.882 0.190 Recall 0.886 0.183
0 1 0 1
a b Classified as TP 3,274 104 TP 3,274 104
3274 449 a = 0 FP 380 449 FN 449 380
380 104 b = 1 PVV 0.896 0.188 Recall 0.879 0.215
0 1 0 1
a b Classified as TP 3,296 112 TP 3,296 112
3296 427 a = 0 FP 372 427 FN 427 372
372 112 b = 1 PVV 0.899 0.208 Recall 0.885 0.231
0 1 0 1
A B Classified as TP 8,835 268 TP 8,835 268
8835 1164 a = 0 FP 1,185 1,164 FN 1,164 1,185
1185 268 b = 1 PVV 0.882 0.187 Recall 0.884 0.184
0 1 0 1
A B Classified as TP 3,259 110 TP 3,259 110
3259 464 a = 0 FP 374 464 FN 464 374
374 110 b = 1 PVV 0.897 0.192 Recall 0.875 0.227
0 1 0 1
A B Classified as TP 3,290 116 TP 3,290 116
3290 433 a = 0 FP 368 433 FN 433 368
368 116 b = 1 PVV 0.899 0.211 Recall 0.884 0.240
NaiveBayes Cross
Validation Model with
40 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 40
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 40
Attributes
5.24 Model-24
5.25 Model-25
NaiveBayes Cross
Validation Model with
35 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 35
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 35
Attributes
5.22 Model-22
5.23 Model-23
NaiveBayes Cross
Validation Model with
30 Attributes
5.20 Model-20
NaiveBayes Cross
Validation Model with
Test Set with 30
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 30
Attributes
5.21 Model-21
NaiveBayes Cross
Validation Model with
25 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 25
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 25
Attributes
5.18 Model-18
5.19 Model-19
NaiveBayes Cross
Validation Model with
20 Attributes
5.16 Model-16
NaiveBayes Cross
Validation Model with
Test Set with 20
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 20
Attributes
5.17 Model-17
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)
P a g e 76 | 100
78. Donor Datamining | Jalaj Nautiyal
7.2 Calculations for Each Model – Specificity and NPV
Model -
Number
Model Description
0 1 0 1
a b Classified as TN 275 8,843 TN 275 8,843
8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178
1,178 275 b = 1 Specificity 0.189 0.884 NPV 0.192 0.882
0 1 0 1
a b Classified as TN 110 3,265 TN 110 3,265
3265 458 a = 0 FP 374 458 FN 458 374
374 110 b = 1 Specificity 0.227 0.877 NPV 0.194 0.897
0 1 0 1
a b Classified as TN 117 3,289 TN 117 3,289
3289 434 a = 0 FP 367 434 FN 434 367
367 117 b = 1 Specificity 0.242 0.883 NPV 0.212 0.900
0 1 0 1
a b Classified as TN 0 3,723 TN 0 3,723
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.885
0 1 0 1
a b Classified as TN 0 3,723 TN 0 3,723
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.885
0 1 0 1
a b Classified as TN 0 9,999 TN 0 9,999
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873
0 1 0 1
a b Classified as TN 0 9,999 TN 0 9,999
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873
0 1 0 1
a b Classified as TN 0 9,999 TN 0 9,999
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873
0 1 0 1
a b Classified as TN 47 9,571 TN 47 9,571
9571 428 a = 0 FP 1,406 428 FN 428 1,406
1406 47 b = 1 Specificity 0.032 0.957 NPV 0.099 0.872
0 1 0 1
a b Classified as TN 29 3,564 TN 29 3,564
3564 159 a = 0 FP 455 159 FN 159 455
455 29 b = 1 Specificity 0.060 0.957 NPV 0.154 0.887
0 1 0 1
a b Classified as TN 30 3,566 TN 30 3,566
3566 157 a = 0 FP 454 157 FN 157 454
454 30 b = 1 Specificity 0.062 0.958 NPV 0.160 0.887
5.7 Model-7
One R Cross
Validation Model with
Test Set Statistics
One R Cross
Validation Model with
Evaluation Set Statistics
5.6 Model-6
OneRCross Validation
Model Statistics
5.5 Model-5
Decision Stump Cross
Validation Model with
test set statistics
Decision Stump Cross
Validation Model with
Evaluate set statistics
J48Graft Training Set
Model with Test set
statistics
J48Graft Training Set
Model with Evaluate set
statistics
5.3 Model-3
Decision Stump Cross
Validation Model
Statistics
5.4 Model-4
5.1 Model-1
NaïveBayes Ten Fold
Cross Validation
Statistics
Confusion Matrix
5.2 Model-2
NaiveBayes 10 Fold
Cross validation with
Test set statistics.
NaiveBayes 10 Fold
Cross validation with
Evaluation set statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑵𝑷𝑽 =
𝑻𝑵
(𝑻𝑵 + 𝑭𝑵)
𝑺𝒑𝒆𝒄𝒊𝒇𝒊𝒄𝒊𝒕𝒚 =
𝑻𝑵
(𝑻𝑵 + 𝑭𝑷)
P a g e 77 | 100