SlideShare a Scribd company logo
Donor Datamining
Course 4957 – Special Topic Datamining
Jalaj Nautiyal
7-9-2015
Donor Datamining | Jalaj Nautiyal
Table of Contents
1.0 Executive Summary..............................................................................................................................3
2.0 Data Set Description.............................................................................................................................4
2.1 Attributes.............................................................................................................................................4
2.1.1 Location ..................................................................................................................................4
2.1.2 Income Level ..........................................................................................................................5
2.1.3 Education ................................................................................................................................6
2.1.4 Median Home Value...............................................................................................................6
2.1.5 Number of Donor’s.................................................................................................................7
2.1.6 Dollar Amount of Gift ............................................................................................................7
2.1.7 Average Dollar Amounts of gifts............................................................................................8
2.1.8 Military Association................................................................................................................9
2.1.9 Type of Donor and RFA .........................................................................................................9
2.1.10 Dollar Gift in 97NK............................................................................................................10
2.1.11 Per Capita............................................................................................................................11
2.1.12 Correlation Matrix ..............................................................................................................11
2.1.13 Regression Coefficients ......................................................................................................12
2.2 Attribute Data-Type...........................................................................................................................13
3.0 Missing Data.......................................................................................................................................14
4.0 Attribute Selection..............................................................................................................................15
4.1 Methodology .....................................................................................................................................15
Step-1: Logic Based Selection.......................................................................................................15
Step-2: Attribute Filtration based on Multiple Attribute Selection Methods.................................26
5.0 Models ................................................................................................................................................31
5.1 Model-1.............................................................................................................................................31
5.2 Model-2:............................................................................................................................................32
5.3 Model-3 .............................................................................................................................................34
5.4 Model-4 .............................................................................................................................................37
5.5 Model-5 .............................................................................................................................................38
5.6 Model-6 .............................................................................................................................................40
5.8 Model-8: ............................................................................................................................................43
5.9 Model-9 .............................................................................................................................................44
P a g e 1 | 100
Donor Datamining | Jalaj Nautiyal
6.0 Different number of attributes but same number of records using NaiveBayes Model........................46
6.1 5 Attributes ....................................................................................................................................46
6.2 10 Attributes.....................................................................................................................................49
6.3 15 Attributes......................................................................................................................................52
6.4 20 Attributes......................................................................................................................................56
6.5 25 Attributes......................................................................................................................................59
6.5 30 Attributes......................................................................................................................................63
6.6 35 Attributes......................................................................................................................................66
6.7 40 Attributes......................................................................................................................................69
7.0 Performance Metrics.............................................................................................................................74
7.1 Calculations for Each Model – Precision and Sensitivity..................................................................74
7.2 Calculations for Each Model – Specificity and NPV ........................................................................77
7.3 Calculations for Each Model – Accuracy and F-Measure.................................................................80
7.4 Comparison of performance of different models based on different algorithms and settings...........83
7.5 Comparison of performance of NaiveBayes Algorithm with different number of attributes............90
8.0 Error Analysis.......................................................................................................................................97
8.1 NaiveBayes .......................................................................................................................................97
8.2 OneR Model......................................................................................................................................98
P a g e 2 | 100
Donor Datamining | Jalaj Nautiyal
1.0 Executive Summary
As part of this project we were provided with veteran’s dataset. Dataset consisted of many different
attributes collected from various sources like census, mailing list etc. among others. The objective
of the project is to utilize the dataset provided and identify probable donor’s.
Various datamining concepts taught in the course were utilized for the purpose of analyzing the
dataset, modelling and identification of probable donor’s. The dataset was first analyzed for data-
type and missing data. Weka was used to preprocess, analyze and interpret the data.
Some of the important attributes were selected from the target dataset and analyzed in depth to
understand the relation/cross correlation among the attributes and how much of the variations in
the selected attributes defines probability of future donor.
For selection of attributes which can predict probable donor, two pronged approach was used. In
first iteration, logic based selection of the attributes from target dataset was conducted. In addition
to this, Weka’s ChiSquared, GainRationAttributeEval, InfoGainAttributeEval were used to
identify top ranked attributes. I then compared the results from these two approaches and
included/excluded few attributes from final list of attributes. The methodology and results are
shared in detail in the paper.
Final list of attributes were used to run on NaiveBayes, J48Graft, DecisionStump, OneR and ZeroR
algorithms to ascertain the attributed selected by above methodology are useful to predict probable
donor. The methodology and results are shared in detail in the paper.
After utilizing above algorithms and generated models, the models were applied to the final dataset
and various statistics were run on the final dataset in Weka to see the accuracy of model fit with
the dataset. A list of control numbers were selected from this final dataset and submitted.
P a g e 3 | 100
Donor Datamining | Jalaj Nautiyal
2.0 Data Set Description
We have a dataset with 47,705 records. Each record has 441 columns. Different attributes and
associated statistics are displayed below.
2.1 Attributes
2.1.1 Location
Geographic location of a particular person is important for ascertaining probable future donor and
out of many possible attribute in dataset, I analyzed states and impact of various income levels to
get distribution of Family and Household incomes – few states analyzed are as follows.
Figure 1 Income Distribution-California
Figure 2 Income Distribution - Colorado
P a g e 4 | 100
Donor Datamining | Jalaj Nautiyal
Figure 3 Income Distribution - Florida
2.1.2 Income Level
Expendable income is an important consideration for estimating future donor and hence number
of people below poverty line analyzed for each state also gives idea about the states for probable
donor’s.
Figure 4 Income Level
P a g e 5 | 100
Donor Datamining | Jalaj Nautiyal
2.1.3 Education
Education is an important parameter in my analysis and a proxy to level of education is the number
of magazine subscription. Higher the education higher is the probability of the person becoming
probable donor.
Figure 5 Magazine Subscription
2.1.4 Median Home Value
Income is an important parameter to ascertain probable donor and following chart shows the
distribution of HV1 – Median home value across 50 states. Higher the value of HV1 more
probable will be the state from which probable donor can be obtained.
Figure 6 Median Home Value
P a g e 6 | 100
Donor Datamining | Jalaj Nautiyal
2.1.5 Number of Donor’s
Analyzing the target dataset provided the current number of donor’s across different states is
important information to ascertain future probable donor and the following chart provides the
information to help on the analysis.
Figure 7 Number of Donor’s
2.1.6 Dollar Amount of Gift
The dollar amount of lifetime gifts to date by current donor’s is important attribute to estimate the
probability of future donor. RAMNTALL is the dollar amount of lifetime gifts to date and the
following chart shows the average values of RAMNTALL in different states.
Figure 8 Life Time Dollar Amount
P a g e 7 | 100
Donor Datamining | Jalaj Nautiyal
LASTGIFT is an attribute for dollar amount of most recent gift. This is important to get an idea
about how recent the current donor gifted. Following chart shows the Average LASTGIFT for
different states.
Figure 9 Average Gift Amount of most recent gift
2.1.7 Average Dollar Amounts of gifts
Average dollar amounts of gifts to date is also important attribute to help estimate future donor.
Following chart shows the average of average dollar amounts (AVGGIFT) of gifts to date for
different states.
Figure 10 Avg Dollar Amt of gifts to date
P a g e 8 | 100
Donor Datamining | Jalaj Nautiyal
2.1.8 Military Association
Association with military in past or current is important attribute to help us estimate the probable
donor. WWIIVETS is the attribute which gives percentage of WWII vets. Following chart shows
the average of the percentage of WWII vets for different states.
Figure 11 Avg World War II Vets
2.1.9 Type of Donor and RFA
Following chart provides information on Super donor’s who have participated in RFA from
RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective
to ascertain probable donor.
Figure 12 Number of Super Donor’s for RFA_3 to RFA_23
Following chart provides information on Active donor’s who have participated in RFA from
RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective
to ascertain probable donor.
P a g e 9 | 100
Donor Datamining | Jalaj Nautiyal
Figure 13 Number of Active Donor’s for RFA_3 to RFA_23
2.1.10 Dollar Gift in 97NK
TARGET_D is the dollar amount associated with response to 97NK mailing. The chart below
shows sum of TARGET_D for different states. This information is important as it gives idea about
the spending of the current donor’s.
Figure 14 Sum of Dollar Gift Amt to 97 Mailing list
P a g e 10 | 100
Donor Datamining | Jalaj Nautiyal
2.1.11 Per Capita
Per capita income of the donor’s(IC-5) is important attribute to ascertain future donor. Following
chart shows average per capita income for donor’s across different states.
Figure 15 Per Capita Income
2.1.12 Correlation Matrix
Following table shows the correlation matrix for some of the attributes to understand how
different attributes impact the person being probable donor. This gave me idea to select some of
the attributes worth considering for final model selection.
Figure 16 Correlation Matrix
P a g e 11 | 100
Donor Datamining | Jalaj Nautiyal
2.1.13 Regression Coefficients
Based on the correlation coefficients, I ran excel regression to ascertain how these attributes are
representative of Target_B and which variables have explanatory power as the starting point for
filtering the number of attributes.
P a g e 12 | 100
Donor Datamining | Jalaj Nautiyal
2.2 Attribute Data-Type
Change of attribute datatype (ask if writing ‘varchar to in’ is appropriate or not)
• I changed the datatype of attributes from IC6 to IC23 from varchar to int. As I wanted to
perform various calculations like sum, average etc. on the values of these attributes.
• I changed the datatype of attribute HHSA4 from varchar to int to perform various
calculations like sum, average etc. on the values of this attribute.
• Table1 below describes few more attributes for which I changed the value from varchar to
int so that I can perform various calculations like sum, average etc. on the values of these
attributes.
Table1
Attribute
Original
Datatype
Changed
Datatype
MBGARDEN varchar int
MBCRAFT varchar int
MBBOOKS varchar int
MBCOLECT varchar int
MAGFAML varchar int
MAGFEM varchar int
MAGMALE varchar int
PUBGARDN varchar int
PUBGARDN varchar int
PUBHLTH varchar int
PUBDOITY varchar int
PUBNEWFN varchar int
PUBPHOTO varchar int
PUBOPP varchar int
P a g e 13 | 100
Donor Datamining | Jalaj Nautiyal
3.0 Missing Data
Analyzing the target dataset, I uncovered following missing data which are listed with the
explanation of data processing done to work with different algorithm.
3.1 Gender
I used TCODE value to identify gender for the records which had missing values of gender.
3.2 Zip
I corrected the zip attribute’s value. For some zip values, it had ‘-‘ at the end of its values. Zip
values were corrected by removing this ‘-’.
3.3 Age and DOB
There were many null values for age attribute. We have a DOB (DateOfBirth) attribute in
our dataset. I referred DOB attribute to find out the missing values of age attribute. But, for
all the missing values of age attribute, the corresponding DOB value was also missing.
I tried to replace the age value using some statistics. I tried to calculate the average age of
current donor’s with respect to state. Since the data with null value of age was very large, I
thought it won’t be a good idea to replace the missing age values as replacing missing age
value might bias the dataset to a great margin.
P a g e 14 | 100
Donor Datamining | Jalaj Nautiyal
4.0 Attribute Selection
4.1 Methodology
Step-1: Logic Based Selection
For attribute selection, I analyzed the target dataset with the objective to identify the possible
donor. Some of the assumptions I used when selecting the attribute from list of possible attributes
are listed below. These become the evaluating conditions for my logic based attribute selection:
1. Higher Education implies higher probability of person to be future donor
2. Higher Income(per capita /income level/number of vehicles/median income/number of
employed persons etc.) implies higher probability of person to be future donor
3. House in good locality implies higher probability of person to be future donor
4. Bigger house implies higher probability of person to be future donor
5. Person renting and paying high rent implies higher probability of person to be future donor
6. Person in active duty or if person served in military in past implies higher probability of
person to be future donor
7. Dollar amount of gift, frequency of gift, recency of gift all gives good idea about
probability of future donor/donations
8. Number, Type and recency of promotions and responsiveness of a person to these efforts
gives good idea about probability of future donor/donations.
9. I also identified some attributes which were negatively correlated with the possibility of
future donations, thus making these variables equally important to estimate future donor:
a. Persons living on social security will probably not be a future donor
b. Persons working in professions where there is not much expendable income will
probably not be a future donor
c. Persons living in rural areas are less likely to donate.
I identified in total 122 attributes out of total 448 possible attributes.
After selection of these 122 attributes, I applied attribute selection – Chi Squared Attribute
Evaluator in Weka to cross reference my logic based selection and apply learnings from the class.
The output from Chi Squared Attribute evaluator was analyzed and compared with my subset of
122 attributes and following are list of 95 attributes which matched my logic based selection and
Chi Squared Attribute evaluator output.
4.2 Included Attributes
Attribute Name Attribute Description Reason
TARGET_D
Donation Amount (in $) associated with the
Response to 97NK Mailing
Donor amount gives indication of the
amount of donation a donor can
provide based on the 1997 donation
history
P a g e 15 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
IC5 Per Capita Income
Per capita income gives indication of
the wealth of the donor base. Higher
the per capita income likely will be
the donation probability
ZIP Zipcode
Zip code provides the likelihood of
donor’s with similar income range
and probability of donation
POP901
Number of Persons in donor’s neighborhood, as
collected from the 1990 US Census.
Number of person’s in donor
neighborhood is indicative of income
range as richer neighborhood tends to
have lower number of persons.
AVGGIFT Average dollar amount of gifts to date
Average amount of gift to date is an
attribute which provides good
indication of probability of donation.
HV1
Median Home Value in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Median home value is indication of
income and higher the median home
value higher probability of donation
HV2
Average Home Value in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Average home value is indication of
income and higher the median home
value higher probability of donation
RAMNTALL Dollar amount of lifetime gifts to date
Total Dollar amount of gift to date is
an attribute which provides good
indication of probability of donation.
IC3
Average Household Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Average household income is
indication of income and higher the
average household income higher
probability of donation
IC2
Median Family Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Median Family income is indication
of income and higher the median
family income higher probability of
donation
IC4
Average Family Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Average Family income is indication
of income and higher the average
family income higher probability of
donation
IC1
Median Household Income in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Median household income is
indication of income and higher the
median household income higher
probability of donation
OSOURCE
Code indicating which mailing list the donor was
originally acquired from
This attribute indicates the chances of
donation based on the marketing
approach to the donor.
LASTGIFT
Dollar amount of most recent gift from giving
history file
Dollar amount of recent gift is an
attribute which provides good
indication of probability of donation
as higher dollar amount of recent gift
P a g e 16 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
gives higher likelihood of donor to
repeat donation
MAXRAMNT
Dollar amount of largest gift to date from giving
history file
Dollar amount of largest gift is an
attribute which provides good
indication of probability of donation
as larger the gift amount higher will
be the likelihood of donor to repeat
donation
RFA_3 Donor's RFA status as of 96NK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_4
Donor's RFA status as of 96TK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_6
Donor's RFA status as of 96LL promotion date from
promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_8
Donor's RFA status as of 96GK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_2
Donor's RFA status as of 97NK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
FISTDATE Date of first gift from giving history file
The date of first gift gives an idea of
how long the donor has been involved
in donation is good attribute for
estimating future donations.
RFA_12
Donor's RFA status as of 96XK promotion date
from promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
MINRAMNT
Dollar amount of smallest gift to date from giving
history file
Dollar amount of the smallest gift is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
RFA_11
Donor's RFA status as of 96X1 promotion date from
promotion history File
RFA is good measure of the
probability of the donor to repeat the
donation.
NGIFTALL
Number of lifetime gifts to date from promotion
history File
Number of lifetime gifts to date is
good measure to estimate future
donation as higher the number of
lifetime gifts higher will be the
probability of future donations.
RFA_2F Frequency code for RFA_2
Frequency of the RFA measure
provides idea about how frequent has
the past donations by the donor and
P a g e 17 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
hence is good estimator for future
donations
RFA_7 Donor's RFA status as of 96G1 promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_2A Donation Amount code for RFA_2
Amount of the RFA measure provides
idea about how much has the past
donations by the donor and hence is
good estimator for future donations
RFA_9 Donor's RFA status as of 96CC promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
MAXRDATE Date associated with the largest gift to date
Date associated with the largest gift is
important to understand how long ago
has the donor donated the largest gift
and is good estimator of future
donation
NUMPROM Lifetime number of promotions received to date
Lifetime number of promotions is
good estimator for future donations as
it shows how much the donor is
responsive to the marketing effort for
donation
CARDGIFT Number of lifetime gifts to card promotions to date
Number of lifetime gifts to card
promotions is good estimator for
future donations as it shows how
much the donor is responsive to the
marketing effort for donation
RFA_16 Donor's RFA status as of 95LL promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
ODATEDW Date of donor's first gift
The date of first gift gives an idea of
how long the donor has been involved
in donation is good attribute for
estimating future donations.
NEXTDATE Date of second gift
The date of second gift gives an idea
of how long the donor has been
involved in donation is good attribute
for estimating future donations.
RFA_14 Donor's RFA status as of 95NK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
HVP1
Percent Home Value >= $200,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
RFA_18 Donor's RFA status as of 95GK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
P a g e 18 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
RFA_5 Donor's RFA status as of 96SK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
MINRDATE Date associated with the smallest gift to date
Date of the smallest gift is negatively
correlated to the future donation and
hence is a good attribute to estimate
future donation
HVP2
Percent Home Value >= $150,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
HVP6
Percent Home Value >= $300,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
RFA_19 Donor's RFA status as of 95CC promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_10 Donor's RFA status as of 96WL promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RP1 Percent Renters Paying >= $500 per Month
Rent of home is indication of income
and higher the home rent higher
probability of donation
CARDPROM
Lifetime number of card promotions received to
date.
Number of lifetime gifts to card
promotions is good estimator for
future donations as it shows how
much the donor is responsive to the
marketing effort for donation
RFA_17 Donor's RFA status as of 95G1 promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
HHAS3
Percent Households w/ Interest, Rental or Dividend
Income in donor’s neighborhood, as collected from
the 1990 US Census.
Households with interest, rental or
dividend income is indication of
income and higher the attribute higher
probability of donation
HVP3
Percent Home Value >= $100,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
RFA_13 Donor's RFA status as of 95FS promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
SEC5
Percent Persons in College in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent persons in college is
indication of education and better the
education higher will be probability
of donation
LFC3
Percent Females in Labor Force in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent female in labor is indication
of income and higher the percent in
P a g e 19 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
labor force higher will be probability
of donation
HVP5
Percent Home Value >= $50,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
NUMPRM12
Number of promotions received in the last 12
months
Number of promotions received in
last 12 months is good estimator for
future donations as it shows how
much the donor is responsive to the
marketing effort for donation
EC4
Percent Adults 25+ Completed High School or
Equivalency in donor’s neighborhood, as collected
from the 1990 US Census.
Percent adults completing high school
or equivalent education is indication
of education and better the education
higher will be probability of donation
HU5
Percent Seasonal/Recreational Vacant Units in
donor’s neighborhood, as collected from the 1990
US Census.
Percent Seasonal/Recreational vacant
unit is indication of income and
higher the attribute higher probability
of donation
HUR2
Percent >= 6 Room Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent 6+ room houses is indication
of income and higher the attribute
higher probability of donation
HVP4
Percent Home Value >= $75,000 in donor’s
neighborhood, as collected from the 1990 US
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
LFC5
Percent Adult Females Employed in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent female in labor is indication
of income and higher the percent in
labor force higher will be probability
of donation
DW1
Percent Single Unit Structure in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent Single unit structure is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
HU1
Percent Owner Occupied Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
AFC1
Percent Adults in Active Military Service in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent Adults in active military
service is indication of association
with military and higher the percent
in active military service higher will
be probability of donation
VC3
Percent WW2 Veterans Age 16+ in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent WW2 veterans is indication
of association with military and
higher the percent of WW2 veterans
higher will be probability of donation
P a g e 20 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
RP3
Percent Renters Paying >= $300 per Month in
donor’s neighborhood, as collected from the 1990
US Census.
Rent of home is indication of income
and higher the home rent higher
probability of donation
DW2
Percent Detached Single Unit Structure in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent detached single unit structure
is negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
RFA_22 Donor's RFA status as of 95XK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
WWIIVETS % WWII Vets
Percent WW2 veterans is indication
of association with military and
higher the percent of WW2 veterans
higher will be probability of donation
HHN2
Percent 2 Person Households in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent 2 person household is
indication of no future liability on
part of donor and hence the person is
more likely to be future donor
VOC2
Percent Households w/ 2+ Vehicles in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent household with 2+ vehicles is
indication of income and higher the
number of vehicles higher probability
of donation
HU2
Percent Renter Occupied Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent renter occupied housing is
positively correlated to the future
donation and hence is a good attribute
to estimate future donation provided
the rent paid is high
AGE Overlay Age
Higher the age higher the probability
of future donations.
HC4
Percent Owner Occupied Structures Built Since
1985 in donor’s neighborhood, as collected from the
1990 US Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as owner
has only one asset
LFC7
Percent 2 Parent Earner Families in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent 2 parent earner family is
indication of income and higher the
percent higher probability of donation
RP2
Percent Renters Paying >= $400 per Month in
donor’s neighborhood, as collected from the 1990
US Census.
Rent of home is indication of income
and higher the home rent higher
probability of donation
AFC2
Percent Males in Active Military Service in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent males in active military
service is indication of association
with military and higher the percent
of attribute higher will be probability
of donation
P a g e 21 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
IC23
Percent Families w/ Income >= $150,000 in
donor’s neighborhood, as collected from the 1990
US Census.
Percent families with income >150k
is indication of income and higher the
attribute higher probability of
donation
OCC9
Percent Farmers in donor’s neighborhood, as
collected from the 1990 US Census.
Percent farmers is negatively
correlated to the future donation and
hence is a good attribute to estimate
future donation as farmers generally
do not have expendable surplus
IC14
Percent Households w/ Income >= $150,000 in
donor’s neighborhood, as collected from the 1990
US Census.
Percent households with income
>150k is indication of income and
higher the attribute higher probability
of donation
EC1
Median Years of School Completed by Adults 25+
in donor’s neighborhood, as collected from the 1990
US Census.
Median years of school completed is
indication of education and better the
education higher will be probability
of donation
LASTDATE Date associated with the most recent gift
The date of recent gift gives an idea
of how long the donor has been
involved in donation is good attribute
for estimating future donations.
LFC6
Percent Mothers Employed Married and Single in
donor’s neighborhood, as collected from the 1990
US Census.
Percent mothers employed, marries
and single gives idea about the
income, liability of the neighborhood
and is good indication of future
donations.
HUPA6
Percent Renter Occupied, 5+ Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent renter occupied 5+ units is
indication of income and higher the
attribute higher probability of
donation
RFA_24 Donor's RFA status as of 94NK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
HU4
Percent Vacant Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent vacant housing units in donor
neighborhood is good indication of
future donations as higher the vacant
units implies higher number of
second homes.
HHAS1
Percent Households on Social Security in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent households on social security
is negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as person
on social security seldom donate.
HC19
Percent Housing Units w/ Public Sewer Source in
donor’s neighborhood, as collected from the 1990
US Census.
Percent housing with public sewer is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as
P a g e 22 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
housings with public sewer generally
are low income housing
VOC1
Percent Households w/ 1+ Vehicles in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent household with 1+ vehicles is
indication of income and higher the
number of vehicles higher probability
of donation
HC7
Percent Owner Occupied Structures Built Since
1960 in donor’s neighborhood, as collected from the
1990 US Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as owner
has only one asset
POP90C2
Percent Population Outside Urbanized Area in
donor’s neighborhood, as collected from the 1990
US Census.
Percent outside urbanized area is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as people
outside urbanized area seldom donate
EIC1
Percent Employed in Agriculture in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent employed in agriculture is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as people
employed in agriculture seldom has
expendable income
HC8
Percent Owner Occupied Structures Built Prior to
1960 in donor’s neighborhood, as collected from the
1990 US Census.
Percent Owner occupied housing is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation as owner
has only one asset
MALEMILI
% Males active in the Military in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent males active in military is
indication of association with military
and higher the percent of percent
higher will be probability of donation
LFC2
Percent Adult Males in Labor Force in donor’s
neighborhood, as collected from the 1990 US
Census.
Percent adult males in labor force is
indication of income and higher the
percent higher will be probability of
donation
OEDC5
Percent Private Profit Wage or Salaried Worker in
donor’s neighborhood, as collected from the 1990
US Census.
Percent private profit wage or salaried
worker is indication of income and
higher the percent higher will be
probability of donation as generally
private profit wage is high and person
has expendable income
In analyzing the output from Chi Squared Attribute evaluator output I also eliminated some of
the high rank attributes from the output generated by Weka. The rationale behind the elimination
was either due to irrelevance to the objective of identifying future donor or due to
multicollinearity in the attributes (more than one attributes conveying same information). The
P a g e 23 | 100
Donor Datamining | Jalaj Nautiyal
purpose is to select appropriate number of attributes which help predict future donor. Following
is the list of attribute which were eliminated from my final attribute selection with reason for
elimination.
4.3 Omitted Attributes
Attribute Name Attribute Description Reason
CONTROLN Control number (unique record identifier)
This is a unique identifier number and
adds no value in identifying future donor.
POP903
Number of Households in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood).
Thus adding no additional value in
identifying future donor.
POP902
Number of Families in donor’s neighborhood, as
collected from the 1990 US Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood –
POP903). Thus adding no additional
value in identifying future donor.
DOB Date of birth of Donor
This attribute is multicollinear with
attribute in our final attribute list (AGE).
Thus adding no additional value in
identifying future donor.
HHP2
Average Person Per Household in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood –
POP903). Thus adding no additional
value in identifying future donor.
HHP1
Median Person Per Household in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood –
POP903). Thus adding no additional
value in identifying future donor.
MSA MSA Code
This is some kind of code and adds no
value in identifying future donor.
ADI ADI Code
This is some kind of code and adds no
value in identifying future donor.
DMA DMA Code
This is some kind of code and adds no
value in identifying future donor.
TPE13 Percent Traveling 15 - 59 Minutes to Work
This attribute adds no value in identifying
future donor as traveling time doesn’t
decide if a person will be future donor.
ETHC3
Percent White Age 60+ in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as person’s race doesn’t
decide if a person will be future donor.
P a g e 24 | 100
Donor Datamining | Jalaj Nautiyal
Attribute Name Attribute Description Reason
TCODE Donor title code
This attribute adds no value in identifying
future donor as person’s title doesn’t
decide if a person will be future donor.
DW7
Percent Group Quarters in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as group quarters doesn’t
decide if a person will be future donor.
POBC2
Percent Born in State of Residence in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as person’s affinity to a
place doesn’t decide if a person will be
future donor.
MARR4
Percent Never Married in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
VC4
Percent Veterans Serving After May 1975 Only
in donor’s neighborhood, as collected from the
1990 US Census.
This attribute is multicollinear with more
than one attribute in our final attribute list
(Number of military service persons,
active military etc.). Thus adding no
additional value in identifying future
donor.
DW9
Non-Institutional Group Quarters in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
HU3
Percent Occupied Housing Units in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute is multicollinear with more
than one attribute in our final attribute list
(Number of vacant house, house type,
number of houses etc.). Thus adding no
additional value in identifying future
donor.
ETH1
Percent White in donor’s neighborhood, as
collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as person’s race doesn’t
decide if a person will be future donor.
MARR1
Percent Married in donor’s neighborhood, as
collected from the 1990 US Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
HHN1
Percent 1 Person Households in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as living alone doesn’t
decide if a person will be future donor.
HHD3
Percent Married Couple Families in donor’s
neighborhood, as collected from the 1990 US
Census.
This attribute adds no value in identifying
future donor as marriage doesn’t decide if
a person will be future donor.
P a g e 25 | 100
Donor Datamining | Jalaj Nautiyal
Step-2: Attribute Filtration based on Multiple Attribute Selection Methods.
As the number of attributes (95 attributes) resulted in unsatisfied statistics on the trained model. I
wanted to run Attribute selection methods available in Weka to gain better understanding of the
impact of attributes on model accuracy.
I ran Chi Squared, GainRationAttributeEval, InfoGainAttributeEval and following are the results
of these three evaluators from Weka.
Chi Sqaured Gains Ratio Information Gain
Ranked attributes: Ranked attributes: Ranked attributes:
472 TARGET_D 472 TARGET_D 73 IC5
470 CONTROLN 362 ADATE_2 74 ZIP
203 IC5 363 ADATE_3 76 POP901
5 ZIP 470 CONTROLN 79 HV2
76 POP901 203 IC5 78 HV1
469 AVGGIFT 5 ZIP 77 AVGGIFT
146 HV1 14 MDMAUD 83 IC4
147 HV2 76 POP901 82 IC2
78 POP903 469 AVGGIFT 81 IC3
77 POP902 366 ADATE_6 84 IC1
457 RAMNTALL 147 HV2 80 RAMNTALL
201 IC3 146 HV1 85 OSOURCE
200 IC2 78 POP903 93 FISTDATE
202 IC4 77 POP902 90 RFA_6
199 IC1 364 ADATE_4 8 MAXRDATE
8 DOB 95 ETH12 47 VOC2
2 OSOURCE 8 DOB 9 NUMPROM
464 LASTGIFT 41 PUBPHOTO 91 RFA_8
462 MAXRAMNT 457 RAMNTALL 94 RFA_12
386 RFA_3 475 RFA_2F 15 HVP1
136 HHP2 476 RFA_2A 86 LASTGIFT
135 HHP1 380 ADATE_20 88 RFA_3
387 RFA_4 202 IC4 2 RFA_11
389 RFA_6 201 IC3 45 WWIIVETS
391 RFA_8 479 MDMAUD_A 42 RP3
196 MSA 200 IC2 18 MINRDATE
385 RFA_2 199 IC1 67 POP90C2
466 FISTDATE 464 LASTGIFT 89 RFA_4
395 RFA_12 98 ETH15 30 LFC3
460 MINRAMNT 2 OSOURCE 5 RFA_7
394 RFA_11 385 RFA_2 87 MAXRAMNT
458 NGIFTALL 462 MAXRAMNT 13 NEXTDATE
475 RFA_2F 145 DW9 7 RFA_9
390 RFA_7 367 ADATE_7 34 HU5
476 RFA_2A 386 RFA_3 64 HC19
392 RFA_9 387 RFA_4 60 HUPA6
463 MAXRDATE 409 MAXADATE 66 HC7
197 ADI 460 MINRAMNT 69 HC8
410 NUMPROM 80 POP90C2 38 DW1
P a g e 26 | 100
Donor Datamining | Jalaj Nautiyal
Chi Sqaured Gains Ratio Information Gain
Ranked attributes: Ranked attributes: Ranked attributes:
198 DMA 305 AFC3 26 HHAS3
459 CARDGIFT 389 RFA_6 41 VC3
399 RFA_16 323 ANC11 63 HHAS1
1 ODATEDW 303 AFC1 43 DW2
467 NEXTDATE 478 MDMAUD_F 35 HUR2
397 RFA_14 391 RFA_8 20 HVP6
173 HVP1 90 ETH7 3 NGIFTALL
401 RFA_18 44 MALEMILI 19 HVP2
388 RFA_5 196 MSA 16 RFA_18
461 MINRDATE 234 TPE6 71 LFC2
174 HVP2 304 AFC2 37 LFC5
178 HVP6 178 HVP6 14 RFA_14
243 TPE13 9 NOEXCH 31 HVP5
402 RFA_19 135 HHP1 51 LFC7
393 RFA_10 29 MBCRAFT 27 HVP3
192 RP1 379 ADATE_19 50 HC4
408 CARDPROM 143 DW7 39 HU1
400 RFA_17 458 NGIFTALL 11 RFA_16
224 HHAS3 395 RFA_12 48 HU2
175 HVP3 136 HHP2 23 RP1
396 RFA_13 79 POP90C1 92 RFA_2
302 SEC5 459 CARDGIFT 44 RFA_22
246 LFC3 392 RFA_9 59 LFC6
177 HVP5 390 RFA_7 21 RFA_19
412 NUMPRM12 394 RFA_11 25 RFA_17
293 EC4 252 LFC9 1 MINRAMNT
154 HU5 144 DW8 36 HVP4
180 HUR2 233 TPE5 46 HHN2
176 HVP4 376 ADATE_16 32 NUMPRM12
248 LFC5 466 FISTDATE 33 EC4
137 DW1 373 ADATE_13 65 VOC1
150 HU1 172 ETHC6 49 AGE
303 AFC1 173 HVP1 52 RP2
311 VC3 1 ODATEDW 62 HU4
194 RP3 463 MAXRDATE 72 OEDC5
138 DW2 197 ADI 10 CARDGIFT
405 RFA_22 86 ETH3 57 EC1
47 WWIIVETS 93 ETH10 4 RFA_2F
169 ETHC3 268 EIC2 22 RFA_10
126 HHN2 198 DMA 24 CARDPROM
335 VOC2 92 ETH9 61 RFA_24
3 TCODE 388 RFA_5 17 RFA_5
143 DW7 212 IC14 53 AFC2
151 HU2 313 ANC1 68 EIC1
17 AGE 345 HC9 58 LASTDATE
340 HC4 85 ETH2 29 SEC5
250 LFC7 411 CARDPM12 75 STATE
P a g e 27 | 100
Donor Datamining | Jalaj Nautiyal
Chi Sqaured Gains Ratio Information Gain
Ranked attributes: Ranked attributes: Ranked attributes:
329 POBC2 81 POP90C3 28 RFA_13
134 MARR4 410 NUMPROM 56 IC14
193 RP2 171 ETHC5 6 RFA_2A
304 AFC2 221 IC23 70 MALEMILI
221 IC23 412 NUMPRM12 55 OCC9
312 VC4 187 HUPA3 12 ODATEDW
262 OCC9 3 TCODE 54 IC23
212 IC14 399 RFA_16 40 AFC1
145 DW9 332 LSC3
152 HU3 397 RFA_14
290 EC1 290 EC1
465 LASTDATE 91 ETH8
84 ETH1 334 VOC1
249 LFC6 356 HC20
190 HUPA6 235 TPE7
407 RFA_24 174 HVP2
153 HU4 465 LASTDATE
222 HHAS1 231 TPE3
355 HC19 339 HC3
334 VOC1 368 ADATE_8
131 MARR1 401 RFA_18
343 HC7 327 ANC15
125 HHN1 87 ETH4
80 POP90C2 35 MAGMALE
267 EIC1 331 LSC2
344 HC8 28 HIT
44 MALEMILI 238 PEC1
245 LFC2 154 HU5
157 HHD3 190 HUPA6
287 OEDC5 177 HVP5
I then conducted overlapping (intersection) analysis of the attributes across above evaluator
methods. I then selected high ranked attributes across these evaluator results for selection of final
(45 attributes) listed below.
Final Attributes Attribute Description
IC5 Per Capita Income
ZIP Zipcode
POP901 Number of Persons in donor’s neighborhood, as collected from the 1990 US Census.
AVGGIFT Average dollar amount of gifts to date
HV1 Median Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
HV2 Average Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
P a g e 28 | 100
Donor Datamining | Jalaj Nautiyal
Final Attributes Attribute Description
IC4 Average Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC3 Average Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC2 Median Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC1 Median Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
RAMNTALL Dollar amount of lifetime gifts to date
DOB Date of birth of Donor
OSOURCE Code indicating which mailing list the donor was originally acquired from
RFA_4 Donor's RFA status as of 96TK promotion date from promotion history File
RFA_6 Donor's RFA status as of 96LL promotion date from promotion history File
RFA_8 Donor's RFA status as of 96GK promotion date from promotion history File
RFA_3 Donor's RFA status as of 96NK promotion date
FISTDATE Date of first gift from giving history file
RFA_12 Donor's RFA status as of 96XK promotion date from promotion history File
MAXRAMNT Dollar amount of largest gift to date from giving history file
RFA_2 Donor's RFA status as of 97NK promotion date from promotion history File
RFA_9 Donor's RFA status as of 96CC promotion date
RFA_7 Donor's RFA status as of 96G1 promotion date
RFA_11 Donor's RFA status as of 96X1 promotion date from promotion history File
RFA_2A Donation Amount code for RFA_2
RFA_2F Frequency code for RFA_2
HVP2 Percent Home Value >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
HVP6 Percent Home Value >= $300,000 in donor’s neighborhood, as collected from the 1990 US Census.
RFA_18 Donor's RFA status as of 95GK promotion date
WWIIVETS % WWII Vets
HHAS3
Percent Households w/ Interest, Rental or Dividend Income in donor’s neighborhood, as collected from the 1990
US Census.
HUR2 Percent >= 6 Room Housing Units in donor’s neighborhood, as collected from the 1990 US Census.
NGIFTALL Number of lifetime gifts to date from promotion history File
HVP5 Percent Home Value >= $50,000 in donor’s neighborhood, as collected from the 1990 US Census.
CARDGIFT Number of lifetime gifts to card promotions to date
LASTGIFT Dollar amount of most recent gift from giving history file
P a g e 29 | 100
Donor Datamining | Jalaj Nautiyal
Final Attributes Attribute Description
AFC1 Percent Adults in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census.
AFC2 Percent Males in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census.
IC23 Percent Families w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
IC14 Percent Households w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
EC1
Median Years of School Completed by Adults 25+ in donor’s neighborhood, as collected from the 1990 US
Census.
LASTDATE Date associated with the most recent gift
VOC1 Percent Households w/ 1+ Vehicles in donor’s neighborhood, as collected from the 1990 US Census.
POP90C2 Percent Population Outside Urbanized Area in donor’s neighborhood, as collected from the 1990 US Census.
P a g e 30 | 100
Donor Datamining | Jalaj Nautiyal
5.0 Models
5.1 Model-1
NaïveBayes 10 Fold Cross Validation Statistics
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
8,843 1,156 a = 0
1,178 275 b = 1
• TP - 8,843 non-donor instances were correctly identified as non-donors by the model.
• TN- 275 donor instances were correctly identified as donors by the model.
• FP- 1,178 donor instances were incorrectly identified as non-donors by the model.
• FN – 1,156 non-donor instances were incorrectly identified as donors by the model.
P a g e 31 | 100
Donor Datamining | Jalaj Nautiyal
5.2 Model-2:
NaiveBayes 10 Fold Cross Validation with Test Set Statistics
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3265 458 a = 0
374 110 b = 1
• TP – 3,265 non-donor instances were correctly identified as non-donors by the model.
• TN- 110 donor instances were correctly identified as donors by the model.
• FP- 374donor instances were incorrectly identified as non-donors by the model.
• FN – 458 non-donor instances were incorrectly identified as donors by the model.
P a g e 32 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3289 434 a = 0
367 117 b = 1
• TP – 3,289 non-donor instances were correctly identified as non-donors by the model.
• TN- 117 donor instances were correctly identified as donors by the model.
• FP- 367 donor instances were incorrectly identified as non-donors by the model.
• FN – 434 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
P a g e 33 | 100
Donor Datamining | Jalaj Nautiyal
Other Models
5.3 Model-3
J48Graft Training Set Model with Test Set Statistics
Algorithm: J48Graft algorithm was used.
Test Options:
• J48Graft algorithm model based on training set.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3,723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 34 | 100
Donor Datamining | Jalaj Nautiyal
J48Graft Training Set Model with Evaluation Set Statistics
Algorithm: J48Graft algorithm was used.
Test Options:
• J48Graft algorithm model based on training set.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3,723 non-donor instances were correctly identified as non-donors by the model.
P a g e 35 | 100
Donor Datamining | Jalaj Nautiyal
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
Snapshot of Tree
The number of Non-donors in the training set for this algorithm were 11,452 and donors were
1453. When J48graft algorithm was ran the algorithm used number of non-donors as the root
node which did not create any further classification beyond what is shown in the figure below.
Conclusion:
Model classified the number of non-donors and donors same as specified in the dataset.
Model failed to calculate the True Negative and False Negative .
P a g e 36 | 100
Donor Datamining | Jalaj Nautiyal
5.4 Model-4
Decision Stump 10 Fold Cross Validation Model Statistics
Algorithm: Decision Stump algorithm was used.
Test Options:
• Decision Stump algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
9999 0 a = 0
1453 0 b = 1
• TP – 9,999 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 1453 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 37 | 100
Donor Datamining | Jalaj Nautiyal
5.5 Model-5
Decision Stump 10 Fold Cross Validation Model with Test Set Statistics
Algorithm: Decision Stump algorithm was used.
Test Options:
• Decision Stump algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3732 0 a = 0
484 0 b = 1
• TP - 3732 non-donor instances were correctly identified as non-donors by the model.
• TN - 0 donor instances were correctly identified as donors by the model.
• FP - 484 donor instances were incorrectly identified as non-donors by the model.
• FN - 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 38 | 100
Donor Datamining | Jalaj Nautiyal
Decision Stump Cross Validation Model with Evaluation Set Statistics
Algorithm: Decision Stump algorithm was used.
Test Options:
• Decision Stump algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 39 | 100
Donor Datamining | Jalaj Nautiyal
Conclusion:
• Model classified the number of non-donors and donors same as specified in the dataset.
• Model failed to calculate the True Positive and False Negative.
5.6 Model-6
OneR 10 Fold Cross Validation Model Statistics
Algorithm: OneR algorithm was used.
Test Options:
• OneR algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
9571 428 a = 0
1406 47 b = 1
• TP – 9,571 non-donor instances were correctly identified as non-donors by the model.
• TN- 47 donor instances were correctly identified as donors by the model.
• FP- 1406 donor instances were incorrectly identified as non-donors by the model.
• FN - 428 non-donor instances were incorrectly identified as donors by the model.
P a g e 40 | 100
Donor Datamining | Jalaj Nautiyal
5.7 Model-7
OneR 10 Fold Cross Validation Model with Test Set Statistics
Algorithm: OneR algorithm was used.
Test Options:
• OneR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3564 159 a = 0
455 29 b = 1
• TP – 3564 non-donor instances were correctly identified as non-donors by the model.
• TN- 29 donor instances were correctly identified as donors by the model.
• FP- 455 donor instances were incorrectly identified as non-donors by the model.
• FN - 159 non-donor instances were incorrectly identified as donors by the model.
P a g e 41 | 100
Donor Datamining | Jalaj Nautiyal
OneR 10 Fold Cross Validation Model with Evaluation Set Statistics
Algorithm: OneR algorithm was used.
Test Options:
• OneR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3566 157 a = 0
454 30 b = 1
• TP – 3566 non-donor instances were correctly identified as non-donors by the model.
• TN- 30 donor instances were correctly identified as donors by the model.
• FP- 454 donor instances were incorrectly identified as non-donors by the model.
• FN – 157 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
P a g e 42 | 100
Donor Datamining | Jalaj Nautiyal
5.8 Model-8:
ZeroR 10 Fold Cross Validation Model Statistics
Algorithm: ZeroR algorithm was used.
Test Options:
• ZeroR algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
9999 0 a = 0
1453 0 b = 1
• TP – 9,999 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 1453 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 43 | 100
Donor Datamining | Jalaj Nautiyal
5.9 Model-9
ZeroR 10 Fold Cross Validation Model with Test Set Statistics
Algorithm: ZeroR algorithm was used to generate model.
Test Options:
• ZeroR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
A B Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 44 | 100
Donor Datamining | Jalaj Nautiyal
ZeroR 10 Fold Cross Validation Model with Evaluation Set Statistics
Algorithm: ZeroR algorithm was used to generate model.
Test Options:
• ZeroR algorithm with 10 fold cross validation settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3723 non-donor instances were correctly identified as non-donors by the model.
• TN- 0 donor instances were correctly identified as donors by the model.
• FP- 484 donor instances were incorrectly identified as non-donors by the model.
• FN – 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 45 | 100
Donor Datamining | Jalaj Nautiyal
Conclusion:
• Model classified the number of nondonors and donors same as specified in the dataset.
• Model failed to calculate the True Positive and False Negative.
6.0 Different number of attributes but same number of records
using NaiveBayes Model
Note: In all the below mentioned model TARGET_B attribute has been used.
6.1 5 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1 were the attributes used to create model.
NaiveBayes 10 Fold Cross Validation Statistics for 5 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
P a g e 46 | 100
Donor Datamining | Jalaj Nautiyal
a b Classified as
9795 204 a = 0
1415 38 b = 1
• TP – 9795 non-donor instances were correctly identified as non-donors by the model.
• TN- 38 donor instances were correctly identified as donors by the model.
• FP- 1415 donor instances were incorrectly identified as non-donors by the model.
• FN – 204 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 5 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3649 74 a = 0
473 11 b = 1
P a g e 47 | 100
Donor Datamining | Jalaj Nautiyal
• TP –3649 non-donor instances were correctly identified as non-donors by the model.
• TN- 11 donor instances were correctly identified as donors by the model.
• FP- 473 donor instances were incorrectly identified as non-donors by the model.
• FN – 74 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 5 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3651 72 a = 0
471 13 b = 1
P a g e 48 | 100
Donor Datamining | Jalaj Nautiyal
• TP –3651 non-donor instances were correctly identified as non-donors by the model.
• TN- 13 donor instances were correctly identified as donors by the model.
• FP- 471 donor instances were incorrectly identified as non-donors by the model.
• FN – 72 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.2 10 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1 were the attributes used to create
model.
NaiveBayes 10 Fold Cross Validation Statistics 10 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
P a g e 49 | 100
Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9320 679 a = 0
1329 124 b = 1
• TP – 9320 non-donor instances were correctly identified as non-donors by the model.
• TN- 124 donor instances were correctly identified as donors by the model.
• FP- 1329 donor instances were incorrectly identified as non-donors by the model.
• FN – 679 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 10 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
P a g e 50 | 100
Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3472 251 a = 0
438 46 b = 1
• TP - 3472 non-donor instances were correctly identified as non-donors by the model.
• TN- 46 donor instances were correctly identified as donors by the model.
• FP- 438 donor instances were incorrectly identified as non-donors by the model.
• FN – 251 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 10 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
P a g e 51 | 100
Donor Datamining | Jalaj Nautiyal
a b Classified as
3465 258 a = 0
439 45 b = 1
• TP - 3465non-donor instances were correctly identified as non-donors by the model.
• TN- 45 donor instances were correctly identified as donors by the model.
• FP- 439 donor instances were incorrectly identified as non-donors by the model.
• FN – 258 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.3 15 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 were the attributes used to create model.
P a g e 52 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation Statistics for 15 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9493 506 a = 0
1351 102 b = 1
• TP - 9493non-donor instances were correctly identified as non-donors by the model.
• TN- 102 donor instances were correctly identified as donors by the model.
• FP- 1351donor instances were incorrectly identified as non-donors by the model.
• FN - 506 non-donor instances were incorrectly identified as donors by the model.
P a g e 53 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 15 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3525 198 a = 0
442 42 b = 1
• TP - 3525 non-donor instances were correctly identified as non-donors by the model.
• TN- 42 donor instances were correctly identified as donors by the model.
• FP- 442 donor instances were incorrectly identified as non-donors by the model.
• FN - 198 non-donor instances were incorrectly identified as donors by the model.
P a g e 54 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 15 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3514 209 a = 0
445 39 b = 1
• TP - 3514 non-donor instances were correctly identified as non-donors by the model.
• TN - 39 donor instances were correctly identified as donors by the model.
• FP - 445 donor instances were incorrectly identified as non-donors by the model.
• FN - 209 non-donor instances were incorrectly identified as donors by model.
P a g e 55 | 100
Donor Datamining | Jalaj Nautiyal
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.4 20 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT were the attributes used
to create model.
NaiveBayes 10 Fold Cross Validation for 20 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9405 594 a = 0
1309 144 b = 1
P a g e 56 | 100
Donor Datamining | Jalaj Nautiyal
• TP - 9405 non-donor instances were correctly identified as non-donors by the model.
• TN- 144 donor instances were correctly identified as donors by the model.
• FP- 1309 donor instances were incorrectly identified as non-donors by the model.
• FN - 594 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 20 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
P a g e 57 | 100
Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3477 246 a = 0
431 53 b = 1
• TP - 3477 non-donor instances were correctly identified as non-donors by the model.
• TN- 53 donor instances were correctly identified as donors by the model.
• FP- 431 donor instances were incorrectly identified as non-donors by the model.
• FN - 246 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 20 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
P a g e 58 | 100
Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3484 239 a = 0
431 53 b = 1
• TP - 3484 non-donor instances were correctly identified as non-donors by the model.
• TN - 53 donor instances were correctly identified as donors by the model.
• FP - 431 donor instances were incorrectly identified as non-donors by the model.
• FN - 239 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.5 25 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A were the attributes used to create model.
P a g e 59 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation Statistics for 25 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
9066 933 a = 0
1239 214 b = 1
• TP - 9066 non-donor instances were correctly identified as non-donors by the model.
• TN- 214 donor instances were correctly identified as donors by the model.
• FP- 1239 donor instances were incorrectly identified as non-donors by the model.
• FN - 933 non-donor instances were incorrectly identified as donors by the model.
P a g e 60 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 25 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3350 373 a = 0
408 76 b = 1
• TP - 3350 non-donor instances were correctly identified as non-donors by the model.
• TN- 76 donor instances were correctly identified as donors by the model.
• FP- 408 donor instances were incorrectly identified as non-donors by the model.
• FN - 373 non-donor instances were incorrectly identified as donors by the model.
P a g e 61 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 25 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3380 343 a = 0
386 98 b = 1
• TP - 3380 non-donor instances were correctly identified as non-donors by the model.
• TN - 98 donor instances were correctly identified as donors by the model.
• FP - 386 donor instances were incorrectly identified as non-donors by the model.
• FN - 343 non-donor instances were incorrectly identified as donors by the model.
P a g e 62 | 100
Donor Datamining | Jalaj Nautiyal
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.5 30 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS were the attributes
used to create model.
NaiveBayes 10 Fold Cross Validation 30 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
8970 1029 a = 0
1213 240 b = 1
P a g e 63 | 100
Donor Datamining | Jalaj Nautiyal
• TP - 8970 non-donor instances were correctly identified as non-donors by the model.
• TN- 240 donor instances were correctly identified as donors by the model.
• FP- 1213 donor instances were incorrectly identified as non-donors by the model.
• FN - 1029 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 30 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3305 418 a = 0
391 93 b = 1
P a g e 64 | 100
Donor Datamining | Jalaj Nautiyal
• TP - 3305 non-donor instances were correctly identified as non-donors by the model.
• TN- 93 donor instances were correctly identified as donors by the model.
• FP-391 donor instances were incorrectly identified as non-donors by the model.
• FN - 418 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 30 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3332 391 a = 0
377 107 b = 1
P a g e 65 | 100
Donor Datamining | Jalaj Nautiyal
• TP - 3332 non-donor instances were correctly identified as non-donors by the model.
• TN - 107 donor instances were correctly identified as donors by the model.
• FP - 377 donor instances were incorrectly identified as non-donors by the model.
• FN - 391 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.6 35 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2,
NGIFTALL, HVP5, CARDGIFT were the attributes used to create model.
NaiveBayes 10 Fold Cross Validation with 35 attributes
P a g e 66 | 100
Donor Datamining | Jalaj Nautiyal
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
8836 1136 a = 0
1187 266 b = 1
• TP - 8836 non-donor instances were correctly identified as non-donors by the model.
• TN- 266 donor instances were correctly identified as donors by the model.
• FP- 1187 donor instances were incorrectly identified as non-donors by the model.
• FN - 1136 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 35 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
P a g e 67 | 100
Donor Datamining | Jalaj Nautiyal
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3274 449 a = 0
380 104 b = 1
• TP - 3274 non-donor instances were correctly identified as non-donors by the model.
• TN- 104 donor instances were correctly identified as donors by the model.
• FP-380 donor instances were incorrectly identified as non-donors by the model.
• FN - 449 non-donor instances were incorrectly identified as donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 35 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
P a g e 68 | 100
Donor Datamining | Jalaj Nautiyal
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3296 427 a = 0
372 112 b = 1
• TP - 3296 non-donor instances were correctly identified as non-donors by the model.
• TN - 112 donor instances were correctly identified as donors by the model.
• FP - 372 donor instances were incorrectly identified as non-donors by the model.
• FN - 427 non-donor instances were incorrectly identified as donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
6.7 40 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2,
NGIFTALL, HVP5, CARDGIFT, LASTGIFT, AFC1, AFC2, IC23, IC14 were the attributes used
to create model.
P a g e 69 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation Statistics with 40 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
8835 1164 a = 0
1185 268 b = 1
• TP - 8835 non-donor instances were correctly identified as non-donors by the model.
• TN- 268 donor instances were correctly identified as donors by the model.
• FP- 1185 donor instances were incorrectly identified as non-donors by the model.
• FN - 1164 non-donor instances were incorrectly identified as donors by the model.
P a g e 70 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 40 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on test dataset.
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3259 464 a = 0
374 110 b = 1
• TP - 3259 non-donor instances were correctly identified as non-donors by the model.
• TN- 110 donor instances were correctly identified as donors by the model.
• FP-374 donor instances were incorrectly identified as non-donors by the model.
• FN - 464 non-donor instances were incorrectly identified as donors by the model.
P a g e 71 | 100
Donor Datamining | Jalaj Nautiyal
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 40 attributes
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
• In addition, supplied test set was used for running the model on evaluation dataset.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3290 433 a = 0
368 116 b = 1
• TP - 3290 non-donor instances were correctly identified as non-donors by the model.
• TN - 112 donor instances were correctly identified as donors by the model.
• FP - 368 donor instances were incorrectly identified as non-donors by the model.
• FN - 433 non-donor instances were incorrectly identified as donors by the model.
P a g e 72 | 100
Donor Datamining | Jalaj Nautiyal
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
P a g e 73 | 100
Donor Datamining | Jalaj Nautiyal
7.0 Performance Metrics
7.1 Calculations for Each Model – Precision and Sensitivity
Model -
Number
Model Description
0 1 0 1
a b Classified as TP 8,843 275 TP 8,843 275
8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178
1,178 275 b = 1 PVV 0.882 0.192 Recall 0.884 0.189
0 1 0 1
a b Classified as TP 3,265 110 TP 3,265 110
3265 458 a = 0 FP 374 458 FN 458 374
374 110 b = 1 PVV 0.897 0.194 Recall 0.877 0.227
0 1 0 1
a b Classified as TP 3,289 117 TP 3,289 117
3289 434 a = 0 FP 367 434 FN 434 367
367 117 b = 1 PVV 0.900 0.212 Recall 0.883 0.242
0 1 0 1
a b Classified as TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,571 47 TP 9,571 47
9571 428 a = 0 FP 1,406 428 FN 428 1,406
1406 47 b = 1 PVV 0.872 0.099 Recall 0.957 0.032
0 1 0 1
a b Classified as TP 3,564 29 TP 3,564 29
3564 159 a = 0 FP 455 159 FN 159 455
455 29 b = 1 PVV 0.887 0.154 Recall 0.957 0.060
0 1 0 1
a b Classified as TP 3,566 30 TP 3,566 30
3566 157 a = 0 FP 454 157 FN 157 454
454 30 b = 1 PVV 0.887 0.160 Recall 0.958 0.062
5.7 Model-7
One R Cross
Validation Model with
Test Set Statistics
One R Cross
Validation Model with
Evaluation Set Statistics
5.6 Model-6
OneRCross Validation
Model Statistics
5.5 Model-5
Decision Stump Cross
Validation Model with
test set statistics
Decision Stump Cross
Validation Model with
Evaluate set statistics
J48Graft Training Set
Model with Test set
statistics
J48Graft Training Set
Model with Evaluate set
statistics
5.3 Model-3
Decision Stump Cross
Validation Model
Statistics
5.4 Model-4
5.1 Model-1
NaïveBayes Ten Fold
Cross Validation
Statistics
Confusion Matrix
5.2 Model-2
NaiveBayes 10 Fold
Cross validation with
Test set statistics.
NaiveBayes 10 Fold
Cross validation with
Evaluation set statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)
P a g e 74 | 100
Donor Datamining | Jalaj Nautiyal
Model -
Number
Model Description
0 1 0 1
a b Classified as TP 9,999 0 TP 9,999 0
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as
TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as TP 9,795 38 TP 9,795 38
9795 204 a = 0 FP 1,415 204 FN 204 1,415
1415 38 b = 1 PVV 0.874 0.157 Recall 0.980 0.026
0 1 0 1
a b Classified as TP 3,649 11 TP 3,649 11
3649 74 a = 0 FP 473 74 FN 74 473
473 11 b = 1 PVV 0.885 0.129 Recall 0.980 0.023
0 1 0 1
a b Classified as TP 3,651 13 TP 3,651 13
3651 72 a = 0 FP 471 72 FN 72 471
471 13 b = 1 PVV 0.886 0.153 Recall 0.981 0.027
0 1 0 1
a b Classified as TP 9,320 124 TP 9,320 124
9320 679 a = 0 FP 1,329 679 FN 679 1,329
1329 124 b = 1 PVV 0.875 0.154 Recall 0.932 0.085
0 1 0 1
a b Classified as TP 3,472 46 TP 3,472 46
3472 251 a = 0 FP 438 251 FN 251 438
438 46 b = 1 PVV 0.888 0.155 Recall 0.933 0.095
0 1 0 1
a b Classified as TP 3,465 46 TP 3,465 46
3465 258 a = 0 FP 439 258 FN 258 439
439 46 b = 1 PVV 0.888 0.151 Recall 0.931 0.095
0 1 0 1
a b Classified as TP 9,493 102 TP 9,493 102
9493 506 a = 0 FP 1,351 506 FN 506 1,351
1351 102 b = 1 PVV 0.875 0.168 Recall 0.949 0.070
0 1 0 1
a b Classified as TP 3,525 42 TP 3,525 42
3525 198 a = 0 FP 442 198 FN 198 442
442 42 b = 1 PVV 0.889 0.175 Recall 0.947 0.087
0 1 0 1
a b Classified as TP 3,514 39 TP 3,514 39
3514 209 a = 0 FP 445 209 FN 209 445
445 39 b = 1 PVV 0.888 0.157 Recall 0.944 0.081
NaiveBayes Cross
Validation Model with
15 Attributes
5.14 Model-14
NaiveBayes Cross
Validation Model with
Test Set with 15
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 15
Attributes
5.15 Model-15
NaiveBayes Cross
Validation Model with
10 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 10
Attributes
5.12 Model-12
NaiveBayes Cross
Validation Model with
Evaluation Set with 10
Attributes
5.13 Model-13
NaïveBayes Cross
Validation Model with 5
Attributes
5.10 Model-10
NaïveBayes Cross
Validation Model with
Test Set Statistics with
5 Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set Statistics
with 5 Attributes
5.11 Model-11
5.8 Model-8
ZeroR Cross Validation
Model Statistics
5.9 Model-9
ZeroR Cross Validation
Model with Test Set
Statistics
ZeroR Cross Validation
Model with Evaluation
Set Statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)
P a g e 75 | 100
Donor Datamining | Jalaj Nautiyal
Model -
Number
Model Description
0 1 0 1
a b Classified as TP 9,405 144 TP 9,405 144
9405 594 a = 0 FP 1,309 594 FN 594 1,309
1309 144 b = 1 PVV 0.878 0.195 Recall 0.941 0.099
0 1 0 1
a b Classified as TP 3,477 53 TP 3,477 53
3477 246 a = 0 FP 431 246 FN 246 431
431 53 b = 1 PVV 0.890 0.177 Recall 0.934 0.110
0 1 0 1
a b Classified as TP 3,484 53 TP 3,484 53
3484 239 a = 0 FP 431 239 FN 239 431
431 53 b = 1 PVV 0.890 0.182 Recall 0.936 0.110
0 1 0 1
a b Classified as TP 9,066 214 TP 9,066 214
9066 933 a = 0 FP 1,239 933 FN 933 1,239
1239 214 b = 1 PVV 0.880 0.187 Recall 0.907 0.147
0 1 0 1
a b Classified as TP 3,350 76 TP 3,350 76
3350 373 a = 0 FP 408 373 FN 373 408
408 76 b = 1 PVV 0.891 0.169 Recall 0.900 0.157
0 1 0 1
a b Classified as TP 3,380 98 TP 3,380 98
3380 343 a = 0 FP 386 343 FN 343 386
386 98 b = 1 PVV 0.898 0.222 Recall 0.908 0.202
0 1 0 1
a b Classified as TP 8,970 240 TP 8,970 240
8970 1029 a = 0 FP 1,213 1,029 FN 1,029 1,213
1213 240 b = 1 PVV 0.881 0.189 Recall 0.897 0.165
0 1 0 1
a b Classified as TP 3,305 93 TP 3,305 93
3305 418 a = 0 FP 391 418 FN 418 391
391 93 b = 1 PVV 0.894 0.182 Recall 0.888 0.192
0 1 0 1
a b Classified as TP 3,332 107 TP 3,332 107
3332 391 a = 0 FP 377 391 FN 391 377
377 107 b = 1 PVV 0.898 0.215 Recall 0.895 0.221
0 1 0 1
a b Classified as TP 8,863 266 TP 8,863 266
8863 1136 a = 0 FP 1,187 1,136 FN 1,136 1,187
1187 266 b = 1 PVV 0.882 0.190 Recall 0.886 0.183
0 1 0 1
a b Classified as TP 3,274 104 TP 3,274 104
3274 449 a = 0 FP 380 449 FN 449 380
380 104 b = 1 PVV 0.896 0.188 Recall 0.879 0.215
0 1 0 1
a b Classified as TP 3,296 112 TP 3,296 112
3296 427 a = 0 FP 372 427 FN 427 372
372 112 b = 1 PVV 0.899 0.208 Recall 0.885 0.231
0 1 0 1
A B Classified as TP 8,835 268 TP 8,835 268
8835 1164 a = 0 FP 1,185 1,164 FN 1,164 1,185
1185 268 b = 1 PVV 0.882 0.187 Recall 0.884 0.184
0 1 0 1
A B Classified as TP 3,259 110 TP 3,259 110
3259 464 a = 0 FP 374 464 FN 464 374
374 110 b = 1 PVV 0.897 0.192 Recall 0.875 0.227
0 1 0 1
A B Classified as TP 3,290 116 TP 3,290 116
3290 433 a = 0 FP 368 433 FN 433 368
368 116 b = 1 PVV 0.899 0.211 Recall 0.884 0.240
NaiveBayes Cross
Validation Model with
40 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 40
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 40
Attributes
5.24 Model-24
5.25 Model-25
NaiveBayes Cross
Validation Model with
35 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 35
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 35
Attributes
5.22 Model-22
5.23 Model-23
NaiveBayes Cross
Validation Model with
30 Attributes
5.20 Model-20
NaiveBayes Cross
Validation Model with
Test Set with 30
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 30
Attributes
5.21 Model-21
NaiveBayes Cross
Validation Model with
25 Attributes
NaiveBayes Cross
Validation Model with
Test Set with 25
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 25
Attributes
5.18 Model-18
5.19 Model-19
NaiveBayes Cross
Validation Model with
20 Attributes
5.16 Model-16
NaiveBayes Cross
Validation Model with
Test Set with 20
Attributes
NaiveBayes Cross
Validation Model with
Evaluation Set with 20
Attributes
5.17 Model-17
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)
P a g e 76 | 100
Donor Datamining | Jalaj Nautiyal
7.2 Calculations for Each Model – Specificity and NPV
Model -
Number
Model Description
0 1 0 1
a b Classified as TN 275 8,843 TN 275 8,843
8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178
1,178 275 b = 1 Specificity 0.189 0.884 NPV 0.192 0.882
0 1 0 1
a b Classified as TN 110 3,265 TN 110 3,265
3265 458 a = 0 FP 374 458 FN 458 374
374 110 b = 1 Specificity 0.227 0.877 NPV 0.194 0.897
0 1 0 1
a b Classified as TN 117 3,289 TN 117 3,289
3289 434 a = 0 FP 367 434 FN 434 367
367 117 b = 1 Specificity 0.242 0.883 NPV 0.212 0.900
0 1 0 1
a b Classified as TN 0 3,723 TN 0 3,723
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.885
0 1 0 1
a b Classified as TN 0 3,723 TN 0 3,723
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.885
0 1 0 1
a b Classified as TN 0 9,999 TN 0 9,999
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873
0 1 0 1
a b Classified as TN 0 9,999 TN 0 9,999
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873
0 1 0 1
a b Classified as TN 0 9,999 TN 0 9,999
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873
0 1 0 1
a b Classified as TN 47 9,571 TN 47 9,571
9571 428 a = 0 FP 1,406 428 FN 428 1,406
1406 47 b = 1 Specificity 0.032 0.957 NPV 0.099 0.872
0 1 0 1
a b Classified as TN 29 3,564 TN 29 3,564
3564 159 a = 0 FP 455 159 FN 159 455
455 29 b = 1 Specificity 0.060 0.957 NPV 0.154 0.887
0 1 0 1
a b Classified as TN 30 3,566 TN 30 3,566
3566 157 a = 0 FP 454 157 FN 157 454
454 30 b = 1 Specificity 0.062 0.958 NPV 0.160 0.887
5.7 Model-7
One R Cross
Validation Model with
Test Set Statistics
One R Cross
Validation Model with
Evaluation Set Statistics
5.6 Model-6
OneRCross Validation
Model Statistics
5.5 Model-5
Decision Stump Cross
Validation Model with
test set statistics
Decision Stump Cross
Validation Model with
Evaluate set statistics
J48Graft Training Set
Model with Test set
statistics
J48Graft Training Set
Model with Evaluate set
statistics
5.3 Model-3
Decision Stump Cross
Validation Model
Statistics
5.4 Model-4
5.1 Model-1
NaïveBayes Ten Fold
Cross Validation
Statistics
Confusion Matrix
5.2 Model-2
NaiveBayes 10 Fold
Cross validation with
Test set statistics.
NaiveBayes 10 Fold
Cross validation with
Evaluation set statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑵𝑷𝑽 =
𝑻𝑵
(𝑻𝑵 + 𝑭𝑵)
𝑺𝒑𝒆𝒄𝒊𝒇𝒊𝒄𝒊𝒕𝒚 =
𝑻𝑵
(𝑻𝑵 + 𝑭𝑷)
P a g e 77 | 100
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining
Veterans_DataMining

More Related Content

Similar to Veterans_DataMining

CAR EVALUATION DATABASE
CAR EVALUATION DATABASECAR EVALUATION DATABASE
CAR EVALUATION DATABASE
Venkatesh Ramshetty Venkataramana
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET Journal
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submitted
Chamath Sajeewa
 
Dwbi Project
Dwbi ProjectDwbi Project
Dwbi Project
Sonali Gupta
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
IRJET Journal
 
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
IRJET Journal
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptx
AbdullahEmam4
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
Danilo Cardona
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
Trushita Redij
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
warishali570
 
1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docx1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docx
drennanmicah
 
Smart Health Guide App
Smart Health Guide AppSmart Health Guide App
Smart Health Guide App
IRJET Journal
 
Data Mining Concepts - A survey paper
Data Mining Concepts - A survey paperData Mining Concepts - A survey paper
Data Mining Concepts - A survey paper
rahulmonikasharma
 
Data Management Project Proposal
Data Management Project ProposalData Management Project Proposal
Data Management Project Proposal
Patrick Garbart
 
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
IRJET Journal
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsData Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Ankit Ghosalkar
 
Data Mining
Data MiningData Mining
Data Mining
SOMASUNDARAM T
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
Jayanti Pande
 

Similar to Veterans_DataMining (20)

CAR EVALUATION DATABASE
CAR EVALUATION DATABASECAR EVALUATION DATABASE
CAR EVALUATION DATABASE
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submitted
 
Dwbi Project
Dwbi ProjectDwbi Project
Dwbi Project
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptx
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
 
1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docx1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docx
 
Smart Health Guide App
Smart Health Guide AppSmart Health Guide App
Smart Health Guide App
 
Data Mining Concepts - A survey paper
Data Mining Concepts - A survey paperData Mining Concepts - A survey paper
Data Mining Concepts - A survey paper
 
Data Management Project Proposal
Data Management Project ProposalData Management Project Proposal
Data Management Project Proposal
 
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsData Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 

Veterans_DataMining

  • 1. Donor Datamining Course 4957 – Special Topic Datamining Jalaj Nautiyal 7-9-2015
  • 2. Donor Datamining | Jalaj Nautiyal Table of Contents 1.0 Executive Summary..............................................................................................................................3 2.0 Data Set Description.............................................................................................................................4 2.1 Attributes.............................................................................................................................................4 2.1.1 Location ..................................................................................................................................4 2.1.2 Income Level ..........................................................................................................................5 2.1.3 Education ................................................................................................................................6 2.1.4 Median Home Value...............................................................................................................6 2.1.5 Number of Donor’s.................................................................................................................7 2.1.6 Dollar Amount of Gift ............................................................................................................7 2.1.7 Average Dollar Amounts of gifts............................................................................................8 2.1.8 Military Association................................................................................................................9 2.1.9 Type of Donor and RFA .........................................................................................................9 2.1.10 Dollar Gift in 97NK............................................................................................................10 2.1.11 Per Capita............................................................................................................................11 2.1.12 Correlation Matrix ..............................................................................................................11 2.1.13 Regression Coefficients ......................................................................................................12 2.2 Attribute Data-Type...........................................................................................................................13 3.0 Missing Data.......................................................................................................................................14 4.0 Attribute Selection..............................................................................................................................15 4.1 Methodology .....................................................................................................................................15 Step-1: Logic Based Selection.......................................................................................................15 Step-2: Attribute Filtration based on Multiple Attribute Selection Methods.................................26 5.0 Models ................................................................................................................................................31 5.1 Model-1.............................................................................................................................................31 5.2 Model-2:............................................................................................................................................32 5.3 Model-3 .............................................................................................................................................34 5.4 Model-4 .............................................................................................................................................37 5.5 Model-5 .............................................................................................................................................38 5.6 Model-6 .............................................................................................................................................40 5.8 Model-8: ............................................................................................................................................43 5.9 Model-9 .............................................................................................................................................44 P a g e 1 | 100
  • 3. Donor Datamining | Jalaj Nautiyal 6.0 Different number of attributes but same number of records using NaiveBayes Model........................46 6.1 5 Attributes ....................................................................................................................................46 6.2 10 Attributes.....................................................................................................................................49 6.3 15 Attributes......................................................................................................................................52 6.4 20 Attributes......................................................................................................................................56 6.5 25 Attributes......................................................................................................................................59 6.5 30 Attributes......................................................................................................................................63 6.6 35 Attributes......................................................................................................................................66 6.7 40 Attributes......................................................................................................................................69 7.0 Performance Metrics.............................................................................................................................74 7.1 Calculations for Each Model – Precision and Sensitivity..................................................................74 7.2 Calculations for Each Model – Specificity and NPV ........................................................................77 7.3 Calculations for Each Model – Accuracy and F-Measure.................................................................80 7.4 Comparison of performance of different models based on different algorithms and settings...........83 7.5 Comparison of performance of NaiveBayes Algorithm with different number of attributes............90 8.0 Error Analysis.......................................................................................................................................97 8.1 NaiveBayes .......................................................................................................................................97 8.2 OneR Model......................................................................................................................................98 P a g e 2 | 100
  • 4. Donor Datamining | Jalaj Nautiyal 1.0 Executive Summary As part of this project we were provided with veteran’s dataset. Dataset consisted of many different attributes collected from various sources like census, mailing list etc. among others. The objective of the project is to utilize the dataset provided and identify probable donor’s. Various datamining concepts taught in the course were utilized for the purpose of analyzing the dataset, modelling and identification of probable donor’s. The dataset was first analyzed for data- type and missing data. Weka was used to preprocess, analyze and interpret the data. Some of the important attributes were selected from the target dataset and analyzed in depth to understand the relation/cross correlation among the attributes and how much of the variations in the selected attributes defines probability of future donor. For selection of attributes which can predict probable donor, two pronged approach was used. In first iteration, logic based selection of the attributes from target dataset was conducted. In addition to this, Weka’s ChiSquared, GainRationAttributeEval, InfoGainAttributeEval were used to identify top ranked attributes. I then compared the results from these two approaches and included/excluded few attributes from final list of attributes. The methodology and results are shared in detail in the paper. Final list of attributes were used to run on NaiveBayes, J48Graft, DecisionStump, OneR and ZeroR algorithms to ascertain the attributed selected by above methodology are useful to predict probable donor. The methodology and results are shared in detail in the paper. After utilizing above algorithms and generated models, the models were applied to the final dataset and various statistics were run on the final dataset in Weka to see the accuracy of model fit with the dataset. A list of control numbers were selected from this final dataset and submitted. P a g e 3 | 100
  • 5. Donor Datamining | Jalaj Nautiyal 2.0 Data Set Description We have a dataset with 47,705 records. Each record has 441 columns. Different attributes and associated statistics are displayed below. 2.1 Attributes 2.1.1 Location Geographic location of a particular person is important for ascertaining probable future donor and out of many possible attribute in dataset, I analyzed states and impact of various income levels to get distribution of Family and Household incomes – few states analyzed are as follows. Figure 1 Income Distribution-California Figure 2 Income Distribution - Colorado P a g e 4 | 100
  • 6. Donor Datamining | Jalaj Nautiyal Figure 3 Income Distribution - Florida 2.1.2 Income Level Expendable income is an important consideration for estimating future donor and hence number of people below poverty line analyzed for each state also gives idea about the states for probable donor’s. Figure 4 Income Level P a g e 5 | 100
  • 7. Donor Datamining | Jalaj Nautiyal 2.1.3 Education Education is an important parameter in my analysis and a proxy to level of education is the number of magazine subscription. Higher the education higher is the probability of the person becoming probable donor. Figure 5 Magazine Subscription 2.1.4 Median Home Value Income is an important parameter to ascertain probable donor and following chart shows the distribution of HV1 – Median home value across 50 states. Higher the value of HV1 more probable will be the state from which probable donor can be obtained. Figure 6 Median Home Value P a g e 6 | 100
  • 8. Donor Datamining | Jalaj Nautiyal 2.1.5 Number of Donor’s Analyzing the target dataset provided the current number of donor’s across different states is important information to ascertain future probable donor and the following chart provides the information to help on the analysis. Figure 7 Number of Donor’s 2.1.6 Dollar Amount of Gift The dollar amount of lifetime gifts to date by current donor’s is important attribute to estimate the probability of future donor. RAMNTALL is the dollar amount of lifetime gifts to date and the following chart shows the average values of RAMNTALL in different states. Figure 8 Life Time Dollar Amount P a g e 7 | 100
  • 9. Donor Datamining | Jalaj Nautiyal LASTGIFT is an attribute for dollar amount of most recent gift. This is important to get an idea about how recent the current donor gifted. Following chart shows the Average LASTGIFT for different states. Figure 9 Average Gift Amount of most recent gift 2.1.7 Average Dollar Amounts of gifts Average dollar amounts of gifts to date is also important attribute to help estimate future donor. Following chart shows the average of average dollar amounts (AVGGIFT) of gifts to date for different states. Figure 10 Avg Dollar Amt of gifts to date P a g e 8 | 100
  • 10. Donor Datamining | Jalaj Nautiyal 2.1.8 Military Association Association with military in past or current is important attribute to help us estimate the probable donor. WWIIVETS is the attribute which gives percentage of WWII vets. Following chart shows the average of the percentage of WWII vets for different states. Figure 11 Avg World War II Vets 2.1.9 Type of Donor and RFA Following chart provides information on Super donor’s who have participated in RFA from RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective to ascertain probable donor. Figure 12 Number of Super Donor’s for RFA_3 to RFA_23 Following chart provides information on Active donor’s who have participated in RFA from RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective to ascertain probable donor. P a g e 9 | 100
  • 11. Donor Datamining | Jalaj Nautiyal Figure 13 Number of Active Donor’s for RFA_3 to RFA_23 2.1.10 Dollar Gift in 97NK TARGET_D is the dollar amount associated with response to 97NK mailing. The chart below shows sum of TARGET_D for different states. This information is important as it gives idea about the spending of the current donor’s. Figure 14 Sum of Dollar Gift Amt to 97 Mailing list P a g e 10 | 100
  • 12. Donor Datamining | Jalaj Nautiyal 2.1.11 Per Capita Per capita income of the donor’s(IC-5) is important attribute to ascertain future donor. Following chart shows average per capita income for donor’s across different states. Figure 15 Per Capita Income 2.1.12 Correlation Matrix Following table shows the correlation matrix for some of the attributes to understand how different attributes impact the person being probable donor. This gave me idea to select some of the attributes worth considering for final model selection. Figure 16 Correlation Matrix P a g e 11 | 100
  • 13. Donor Datamining | Jalaj Nautiyal 2.1.13 Regression Coefficients Based on the correlation coefficients, I ran excel regression to ascertain how these attributes are representative of Target_B and which variables have explanatory power as the starting point for filtering the number of attributes. P a g e 12 | 100
  • 14. Donor Datamining | Jalaj Nautiyal 2.2 Attribute Data-Type Change of attribute datatype (ask if writing ‘varchar to in’ is appropriate or not) • I changed the datatype of attributes from IC6 to IC23 from varchar to int. As I wanted to perform various calculations like sum, average etc. on the values of these attributes. • I changed the datatype of attribute HHSA4 from varchar to int to perform various calculations like sum, average etc. on the values of this attribute. • Table1 below describes few more attributes for which I changed the value from varchar to int so that I can perform various calculations like sum, average etc. on the values of these attributes. Table1 Attribute Original Datatype Changed Datatype MBGARDEN varchar int MBCRAFT varchar int MBBOOKS varchar int MBCOLECT varchar int MAGFAML varchar int MAGFEM varchar int MAGMALE varchar int PUBGARDN varchar int PUBGARDN varchar int PUBHLTH varchar int PUBDOITY varchar int PUBNEWFN varchar int PUBPHOTO varchar int PUBOPP varchar int P a g e 13 | 100
  • 15. Donor Datamining | Jalaj Nautiyal 3.0 Missing Data Analyzing the target dataset, I uncovered following missing data which are listed with the explanation of data processing done to work with different algorithm. 3.1 Gender I used TCODE value to identify gender for the records which had missing values of gender. 3.2 Zip I corrected the zip attribute’s value. For some zip values, it had ‘-‘ at the end of its values. Zip values were corrected by removing this ‘-’. 3.3 Age and DOB There were many null values for age attribute. We have a DOB (DateOfBirth) attribute in our dataset. I referred DOB attribute to find out the missing values of age attribute. But, for all the missing values of age attribute, the corresponding DOB value was also missing. I tried to replace the age value using some statistics. I tried to calculate the average age of current donor’s with respect to state. Since the data with null value of age was very large, I thought it won’t be a good idea to replace the missing age values as replacing missing age value might bias the dataset to a great margin. P a g e 14 | 100
  • 16. Donor Datamining | Jalaj Nautiyal 4.0 Attribute Selection 4.1 Methodology Step-1: Logic Based Selection For attribute selection, I analyzed the target dataset with the objective to identify the possible donor. Some of the assumptions I used when selecting the attribute from list of possible attributes are listed below. These become the evaluating conditions for my logic based attribute selection: 1. Higher Education implies higher probability of person to be future donor 2. Higher Income(per capita /income level/number of vehicles/median income/number of employed persons etc.) implies higher probability of person to be future donor 3. House in good locality implies higher probability of person to be future donor 4. Bigger house implies higher probability of person to be future donor 5. Person renting and paying high rent implies higher probability of person to be future donor 6. Person in active duty or if person served in military in past implies higher probability of person to be future donor 7. Dollar amount of gift, frequency of gift, recency of gift all gives good idea about probability of future donor/donations 8. Number, Type and recency of promotions and responsiveness of a person to these efforts gives good idea about probability of future donor/donations. 9. I also identified some attributes which were negatively correlated with the possibility of future donations, thus making these variables equally important to estimate future donor: a. Persons living on social security will probably not be a future donor b. Persons working in professions where there is not much expendable income will probably not be a future donor c. Persons living in rural areas are less likely to donate. I identified in total 122 attributes out of total 448 possible attributes. After selection of these 122 attributes, I applied attribute selection – Chi Squared Attribute Evaluator in Weka to cross reference my logic based selection and apply learnings from the class. The output from Chi Squared Attribute evaluator was analyzed and compared with my subset of 122 attributes and following are list of 95 attributes which matched my logic based selection and Chi Squared Attribute evaluator output. 4.2 Included Attributes Attribute Name Attribute Description Reason TARGET_D Donation Amount (in $) associated with the Response to 97NK Mailing Donor amount gives indication of the amount of donation a donor can provide based on the 1997 donation history P a g e 15 | 100
  • 17. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason IC5 Per Capita Income Per capita income gives indication of the wealth of the donor base. Higher the per capita income likely will be the donation probability ZIP Zipcode Zip code provides the likelihood of donor’s with similar income range and probability of donation POP901 Number of Persons in donor’s neighborhood, as collected from the 1990 US Census. Number of person’s in donor neighborhood is indicative of income range as richer neighborhood tends to have lower number of persons. AVGGIFT Average dollar amount of gifts to date Average amount of gift to date is an attribute which provides good indication of probability of donation. HV1 Median Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census. Median home value is indication of income and higher the median home value higher probability of donation HV2 Average Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census. Average home value is indication of income and higher the median home value higher probability of donation RAMNTALL Dollar amount of lifetime gifts to date Total Dollar amount of gift to date is an attribute which provides good indication of probability of donation. IC3 Average Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. Average household income is indication of income and higher the average household income higher probability of donation IC2 Median Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. Median Family income is indication of income and higher the median family income higher probability of donation IC4 Average Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. Average Family income is indication of income and higher the average family income higher probability of donation IC1 Median Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. Median household income is indication of income and higher the median household income higher probability of donation OSOURCE Code indicating which mailing list the donor was originally acquired from This attribute indicates the chances of donation based on the marketing approach to the donor. LASTGIFT Dollar amount of most recent gift from giving history file Dollar amount of recent gift is an attribute which provides good indication of probability of donation as higher dollar amount of recent gift P a g e 16 | 100
  • 18. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason gives higher likelihood of donor to repeat donation MAXRAMNT Dollar amount of largest gift to date from giving history file Dollar amount of largest gift is an attribute which provides good indication of probability of donation as larger the gift amount higher will be the likelihood of donor to repeat donation RFA_3 Donor's RFA status as of 96NK promotion date RFA is good measure of the probability of the donor to repeat the donation. RFA_4 Donor's RFA status as of 96TK promotion date from promotion history File RFA is good measure of the probability of the donor to repeat the donation. RFA_6 Donor's RFA status as of 96LL promotion date from promotion history File RFA is good measure of the probability of the donor to repeat the donation. RFA_8 Donor's RFA status as of 96GK promotion date from promotion history File RFA is good measure of the probability of the donor to repeat the donation. RFA_2 Donor's RFA status as of 97NK promotion date from promotion history File RFA is good measure of the probability of the donor to repeat the donation. FISTDATE Date of first gift from giving history file The date of first gift gives an idea of how long the donor has been involved in donation is good attribute for estimating future donations. RFA_12 Donor's RFA status as of 96XK promotion date from promotion history File RFA is good measure of the probability of the donor to repeat the donation. MINRAMNT Dollar amount of smallest gift to date from giving history file Dollar amount of the smallest gift is negatively correlated to the future donation and hence is a good attribute to estimate future donation RFA_11 Donor's RFA status as of 96X1 promotion date from promotion history File RFA is good measure of the probability of the donor to repeat the donation. NGIFTALL Number of lifetime gifts to date from promotion history File Number of lifetime gifts to date is good measure to estimate future donation as higher the number of lifetime gifts higher will be the probability of future donations. RFA_2F Frequency code for RFA_2 Frequency of the RFA measure provides idea about how frequent has the past donations by the donor and P a g e 17 | 100
  • 19. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason hence is good estimator for future donations RFA_7 Donor's RFA status as of 96G1 promotion date RFA is good measure of the probability of the donor to repeat the donation. RFA_2A Donation Amount code for RFA_2 Amount of the RFA measure provides idea about how much has the past donations by the donor and hence is good estimator for future donations RFA_9 Donor's RFA status as of 96CC promotion date RFA is good measure of the probability of the donor to repeat the donation. MAXRDATE Date associated with the largest gift to date Date associated with the largest gift is important to understand how long ago has the donor donated the largest gift and is good estimator of future donation NUMPROM Lifetime number of promotions received to date Lifetime number of promotions is good estimator for future donations as it shows how much the donor is responsive to the marketing effort for donation CARDGIFT Number of lifetime gifts to card promotions to date Number of lifetime gifts to card promotions is good estimator for future donations as it shows how much the donor is responsive to the marketing effort for donation RFA_16 Donor's RFA status as of 95LL promotion date RFA is good measure of the probability of the donor to repeat the donation. ODATEDW Date of donor's first gift The date of first gift gives an idea of how long the donor has been involved in donation is good attribute for estimating future donations. NEXTDATE Date of second gift The date of second gift gives an idea of how long the donor has been involved in donation is good attribute for estimating future donations. RFA_14 Donor's RFA status as of 95NK promotion date RFA is good measure of the probability of the donor to repeat the donation. HVP1 Percent Home Value >= $200,000 in donor’s neighborhood, as collected from the 1990 US Census. Value of home is indication of income and higher the home value higher probability of donation RFA_18 Donor's RFA status as of 95GK promotion date RFA is good measure of the probability of the donor to repeat the donation. P a g e 18 | 100
  • 20. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason RFA_5 Donor's RFA status as of 96SK promotion date RFA is good measure of the probability of the donor to repeat the donation. MINRDATE Date associated with the smallest gift to date Date of the smallest gift is negatively correlated to the future donation and hence is a good attribute to estimate future donation HVP2 Percent Home Value >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census. Value of home is indication of income and higher the home value higher probability of donation HVP6 Percent Home Value >= $300,000 in donor’s neighborhood, as collected from the 1990 US Census. Value of home is indication of income and higher the home value higher probability of donation RFA_19 Donor's RFA status as of 95CC promotion date RFA is good measure of the probability of the donor to repeat the donation. RFA_10 Donor's RFA status as of 96WL promotion date RFA is good measure of the probability of the donor to repeat the donation. RP1 Percent Renters Paying >= $500 per Month Rent of home is indication of income and higher the home rent higher probability of donation CARDPROM Lifetime number of card promotions received to date. Number of lifetime gifts to card promotions is good estimator for future donations as it shows how much the donor is responsive to the marketing effort for donation RFA_17 Donor's RFA status as of 95G1 promotion date RFA is good measure of the probability of the donor to repeat the donation. HHAS3 Percent Households w/ Interest, Rental or Dividend Income in donor’s neighborhood, as collected from the 1990 US Census. Households with interest, rental or dividend income is indication of income and higher the attribute higher probability of donation HVP3 Percent Home Value >= $100,000 in donor’s neighborhood, as collected from the 1990 US Census. Value of home is indication of income and higher the home value higher probability of donation RFA_13 Donor's RFA status as of 95FS promotion date RFA is good measure of the probability of the donor to repeat the donation. SEC5 Percent Persons in College in donor’s neighborhood, as collected from the 1990 US Census. Percent persons in college is indication of education and better the education higher will be probability of donation LFC3 Percent Females in Labor Force in donor’s neighborhood, as collected from the 1990 US Census. Percent female in labor is indication of income and higher the percent in P a g e 19 | 100
  • 21. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason labor force higher will be probability of donation HVP5 Percent Home Value >= $50,000 in donor’s neighborhood, as collected from the 1990 US Census. Value of home is indication of income and higher the home value higher probability of donation NUMPRM12 Number of promotions received in the last 12 months Number of promotions received in last 12 months is good estimator for future donations as it shows how much the donor is responsive to the marketing effort for donation EC4 Percent Adults 25+ Completed High School or Equivalency in donor’s neighborhood, as collected from the 1990 US Census. Percent adults completing high school or equivalent education is indication of education and better the education higher will be probability of donation HU5 Percent Seasonal/Recreational Vacant Units in donor’s neighborhood, as collected from the 1990 US Census. Percent Seasonal/Recreational vacant unit is indication of income and higher the attribute higher probability of donation HUR2 Percent >= 6 Room Housing Units in donor’s neighborhood, as collected from the 1990 US Census. Percent 6+ room houses is indication of income and higher the attribute higher probability of donation HVP4 Percent Home Value >= $75,000 in donor’s neighborhood, as collected from the 1990 US Census. Value of home is indication of income and higher the home value higher probability of donation LFC5 Percent Adult Females Employed in donor’s neighborhood, as collected from the 1990 US Census. Percent female in labor is indication of income and higher the percent in labor force higher will be probability of donation DW1 Percent Single Unit Structure in donor’s neighborhood, as collected from the 1990 US Census. Percent Single unit structure is negatively correlated to the future donation and hence is a good attribute to estimate future donation HU1 Percent Owner Occupied Housing Units in donor’s neighborhood, as collected from the 1990 US Census. Percent Owner occupied housing is negatively correlated to the future donation and hence is a good attribute to estimate future donation AFC1 Percent Adults in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census. Percent Adults in active military service is indication of association with military and higher the percent in active military service higher will be probability of donation VC3 Percent WW2 Veterans Age 16+ in donor’s neighborhood, as collected from the 1990 US Census. Percent WW2 veterans is indication of association with military and higher the percent of WW2 veterans higher will be probability of donation P a g e 20 | 100
  • 22. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason RP3 Percent Renters Paying >= $300 per Month in donor’s neighborhood, as collected from the 1990 US Census. Rent of home is indication of income and higher the home rent higher probability of donation DW2 Percent Detached Single Unit Structure in donor’s neighborhood, as collected from the 1990 US Census. Percent detached single unit structure is negatively correlated to the future donation and hence is a good attribute to estimate future donation RFA_22 Donor's RFA status as of 95XK promotion date RFA is good measure of the probability of the donor to repeat the donation. WWIIVETS % WWII Vets Percent WW2 veterans is indication of association with military and higher the percent of WW2 veterans higher will be probability of donation HHN2 Percent 2 Person Households in donor’s neighborhood, as collected from the 1990 US Census. Percent 2 person household is indication of no future liability on part of donor and hence the person is more likely to be future donor VOC2 Percent Households w/ 2+ Vehicles in donor’s neighborhood, as collected from the 1990 US Census. Percent household with 2+ vehicles is indication of income and higher the number of vehicles higher probability of donation HU2 Percent Renter Occupied Housing Units in donor’s neighborhood, as collected from the 1990 US Census. Percent renter occupied housing is positively correlated to the future donation and hence is a good attribute to estimate future donation provided the rent paid is high AGE Overlay Age Higher the age higher the probability of future donations. HC4 Percent Owner Occupied Structures Built Since 1985 in donor’s neighborhood, as collected from the 1990 US Census. Percent Owner occupied housing is negatively correlated to the future donation and hence is a good attribute to estimate future donation as owner has only one asset LFC7 Percent 2 Parent Earner Families in donor’s neighborhood, as collected from the 1990 US Census. Percent 2 parent earner family is indication of income and higher the percent higher probability of donation RP2 Percent Renters Paying >= $400 per Month in donor’s neighborhood, as collected from the 1990 US Census. Rent of home is indication of income and higher the home rent higher probability of donation AFC2 Percent Males in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census. Percent males in active military service is indication of association with military and higher the percent of attribute higher will be probability of donation P a g e 21 | 100
  • 23. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason IC23 Percent Families w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census. Percent families with income >150k is indication of income and higher the attribute higher probability of donation OCC9 Percent Farmers in donor’s neighborhood, as collected from the 1990 US Census. Percent farmers is negatively correlated to the future donation and hence is a good attribute to estimate future donation as farmers generally do not have expendable surplus IC14 Percent Households w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census. Percent households with income >150k is indication of income and higher the attribute higher probability of donation EC1 Median Years of School Completed by Adults 25+ in donor’s neighborhood, as collected from the 1990 US Census. Median years of school completed is indication of education and better the education higher will be probability of donation LASTDATE Date associated with the most recent gift The date of recent gift gives an idea of how long the donor has been involved in donation is good attribute for estimating future donations. LFC6 Percent Mothers Employed Married and Single in donor’s neighborhood, as collected from the 1990 US Census. Percent mothers employed, marries and single gives idea about the income, liability of the neighborhood and is good indication of future donations. HUPA6 Percent Renter Occupied, 5+ Units in donor’s neighborhood, as collected from the 1990 US Census. Percent renter occupied 5+ units is indication of income and higher the attribute higher probability of donation RFA_24 Donor's RFA status as of 94NK promotion date RFA is good measure of the probability of the donor to repeat the donation. HU4 Percent Vacant Housing Units in donor’s neighborhood, as collected from the 1990 US Census. Percent vacant housing units in donor neighborhood is good indication of future donations as higher the vacant units implies higher number of second homes. HHAS1 Percent Households on Social Security in donor’s neighborhood, as collected from the 1990 US Census. Percent households on social security is negatively correlated to the future donation and hence is a good attribute to estimate future donation as person on social security seldom donate. HC19 Percent Housing Units w/ Public Sewer Source in donor’s neighborhood, as collected from the 1990 US Census. Percent housing with public sewer is negatively correlated to the future donation and hence is a good attribute to estimate future donation as P a g e 22 | 100
  • 24. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason housings with public sewer generally are low income housing VOC1 Percent Households w/ 1+ Vehicles in donor’s neighborhood, as collected from the 1990 US Census. Percent household with 1+ vehicles is indication of income and higher the number of vehicles higher probability of donation HC7 Percent Owner Occupied Structures Built Since 1960 in donor’s neighborhood, as collected from the 1990 US Census. Percent Owner occupied housing is negatively correlated to the future donation and hence is a good attribute to estimate future donation as owner has only one asset POP90C2 Percent Population Outside Urbanized Area in donor’s neighborhood, as collected from the 1990 US Census. Percent outside urbanized area is negatively correlated to the future donation and hence is a good attribute to estimate future donation as people outside urbanized area seldom donate EIC1 Percent Employed in Agriculture in donor’s neighborhood, as collected from the 1990 US Census. Percent employed in agriculture is negatively correlated to the future donation and hence is a good attribute to estimate future donation as people employed in agriculture seldom has expendable income HC8 Percent Owner Occupied Structures Built Prior to 1960 in donor’s neighborhood, as collected from the 1990 US Census. Percent Owner occupied housing is negatively correlated to the future donation and hence is a good attribute to estimate future donation as owner has only one asset MALEMILI % Males active in the Military in donor’s neighborhood, as collected from the 1990 US Census. Percent males active in military is indication of association with military and higher the percent of percent higher will be probability of donation LFC2 Percent Adult Males in Labor Force in donor’s neighborhood, as collected from the 1990 US Census. Percent adult males in labor force is indication of income and higher the percent higher will be probability of donation OEDC5 Percent Private Profit Wage or Salaried Worker in donor’s neighborhood, as collected from the 1990 US Census. Percent private profit wage or salaried worker is indication of income and higher the percent higher will be probability of donation as generally private profit wage is high and person has expendable income In analyzing the output from Chi Squared Attribute evaluator output I also eliminated some of the high rank attributes from the output generated by Weka. The rationale behind the elimination was either due to irrelevance to the objective of identifying future donor or due to multicollinearity in the attributes (more than one attributes conveying same information). The P a g e 23 | 100
  • 25. Donor Datamining | Jalaj Nautiyal purpose is to select appropriate number of attributes which help predict future donor. Following is the list of attribute which were eliminated from my final attribute selection with reason for elimination. 4.3 Omitted Attributes Attribute Name Attribute Description Reason CONTROLN Control number (unique record identifier) This is a unique identifier number and adds no value in identifying future donor. POP903 Number of Households in donor’s neighborhood, as collected from the 1990 US Census. This attribute is multicollinear with attribute in our final attribute list (Number of person in neighborhood). Thus adding no additional value in identifying future donor. POP902 Number of Families in donor’s neighborhood, as collected from the 1990 US Census. This attribute is multicollinear with attribute in our final attribute list (Number of person in neighborhood – POP903). Thus adding no additional value in identifying future donor. DOB Date of birth of Donor This attribute is multicollinear with attribute in our final attribute list (AGE). Thus adding no additional value in identifying future donor. HHP2 Average Person Per Household in donor’s neighborhood, as collected from the 1990 US Census. This attribute is multicollinear with attribute in our final attribute list (Number of person in neighborhood – POP903). Thus adding no additional value in identifying future donor. HHP1 Median Person Per Household in donor’s neighborhood, as collected from the 1990 US Census. This attribute is multicollinear with attribute in our final attribute list (Number of person in neighborhood – POP903). Thus adding no additional value in identifying future donor. MSA MSA Code This is some kind of code and adds no value in identifying future donor. ADI ADI Code This is some kind of code and adds no value in identifying future donor. DMA DMA Code This is some kind of code and adds no value in identifying future donor. TPE13 Percent Traveling 15 - 59 Minutes to Work This attribute adds no value in identifying future donor as traveling time doesn’t decide if a person will be future donor. ETHC3 Percent White Age 60+ in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as person’s race doesn’t decide if a person will be future donor. P a g e 24 | 100
  • 26. Donor Datamining | Jalaj Nautiyal Attribute Name Attribute Description Reason TCODE Donor title code This attribute adds no value in identifying future donor as person’s title doesn’t decide if a person will be future donor. DW7 Percent Group Quarters in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as group quarters doesn’t decide if a person will be future donor. POBC2 Percent Born in State of Residence in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as person’s affinity to a place doesn’t decide if a person will be future donor. MARR4 Percent Never Married in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as marriage doesn’t decide if a person will be future donor. VC4 Percent Veterans Serving After May 1975 Only in donor’s neighborhood, as collected from the 1990 US Census. This attribute is multicollinear with more than one attribute in our final attribute list (Number of military service persons, active military etc.). Thus adding no additional value in identifying future donor. DW9 Non-Institutional Group Quarters in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as marriage doesn’t decide if a person will be future donor. HU3 Percent Occupied Housing Units in donor’s neighborhood, as collected from the 1990 US Census. This attribute is multicollinear with more than one attribute in our final attribute list (Number of vacant house, house type, number of houses etc.). Thus adding no additional value in identifying future donor. ETH1 Percent White in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as person’s race doesn’t decide if a person will be future donor. MARR1 Percent Married in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as marriage doesn’t decide if a person will be future donor. HHN1 Percent 1 Person Households in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as living alone doesn’t decide if a person will be future donor. HHD3 Percent Married Couple Families in donor’s neighborhood, as collected from the 1990 US Census. This attribute adds no value in identifying future donor as marriage doesn’t decide if a person will be future donor. P a g e 25 | 100
  • 27. Donor Datamining | Jalaj Nautiyal Step-2: Attribute Filtration based on Multiple Attribute Selection Methods. As the number of attributes (95 attributes) resulted in unsatisfied statistics on the trained model. I wanted to run Attribute selection methods available in Weka to gain better understanding of the impact of attributes on model accuracy. I ran Chi Squared, GainRationAttributeEval, InfoGainAttributeEval and following are the results of these three evaluators from Weka. Chi Sqaured Gains Ratio Information Gain Ranked attributes: Ranked attributes: Ranked attributes: 472 TARGET_D 472 TARGET_D 73 IC5 470 CONTROLN 362 ADATE_2 74 ZIP 203 IC5 363 ADATE_3 76 POP901 5 ZIP 470 CONTROLN 79 HV2 76 POP901 203 IC5 78 HV1 469 AVGGIFT 5 ZIP 77 AVGGIFT 146 HV1 14 MDMAUD 83 IC4 147 HV2 76 POP901 82 IC2 78 POP903 469 AVGGIFT 81 IC3 77 POP902 366 ADATE_6 84 IC1 457 RAMNTALL 147 HV2 80 RAMNTALL 201 IC3 146 HV1 85 OSOURCE 200 IC2 78 POP903 93 FISTDATE 202 IC4 77 POP902 90 RFA_6 199 IC1 364 ADATE_4 8 MAXRDATE 8 DOB 95 ETH12 47 VOC2 2 OSOURCE 8 DOB 9 NUMPROM 464 LASTGIFT 41 PUBPHOTO 91 RFA_8 462 MAXRAMNT 457 RAMNTALL 94 RFA_12 386 RFA_3 475 RFA_2F 15 HVP1 136 HHP2 476 RFA_2A 86 LASTGIFT 135 HHP1 380 ADATE_20 88 RFA_3 387 RFA_4 202 IC4 2 RFA_11 389 RFA_6 201 IC3 45 WWIIVETS 391 RFA_8 479 MDMAUD_A 42 RP3 196 MSA 200 IC2 18 MINRDATE 385 RFA_2 199 IC1 67 POP90C2 466 FISTDATE 464 LASTGIFT 89 RFA_4 395 RFA_12 98 ETH15 30 LFC3 460 MINRAMNT 2 OSOURCE 5 RFA_7 394 RFA_11 385 RFA_2 87 MAXRAMNT 458 NGIFTALL 462 MAXRAMNT 13 NEXTDATE 475 RFA_2F 145 DW9 7 RFA_9 390 RFA_7 367 ADATE_7 34 HU5 476 RFA_2A 386 RFA_3 64 HC19 392 RFA_9 387 RFA_4 60 HUPA6 463 MAXRDATE 409 MAXADATE 66 HC7 197 ADI 460 MINRAMNT 69 HC8 410 NUMPROM 80 POP90C2 38 DW1 P a g e 26 | 100
  • 28. Donor Datamining | Jalaj Nautiyal Chi Sqaured Gains Ratio Information Gain Ranked attributes: Ranked attributes: Ranked attributes: 198 DMA 305 AFC3 26 HHAS3 459 CARDGIFT 389 RFA_6 41 VC3 399 RFA_16 323 ANC11 63 HHAS1 1 ODATEDW 303 AFC1 43 DW2 467 NEXTDATE 478 MDMAUD_F 35 HUR2 397 RFA_14 391 RFA_8 20 HVP6 173 HVP1 90 ETH7 3 NGIFTALL 401 RFA_18 44 MALEMILI 19 HVP2 388 RFA_5 196 MSA 16 RFA_18 461 MINRDATE 234 TPE6 71 LFC2 174 HVP2 304 AFC2 37 LFC5 178 HVP6 178 HVP6 14 RFA_14 243 TPE13 9 NOEXCH 31 HVP5 402 RFA_19 135 HHP1 51 LFC7 393 RFA_10 29 MBCRAFT 27 HVP3 192 RP1 379 ADATE_19 50 HC4 408 CARDPROM 143 DW7 39 HU1 400 RFA_17 458 NGIFTALL 11 RFA_16 224 HHAS3 395 RFA_12 48 HU2 175 HVP3 136 HHP2 23 RP1 396 RFA_13 79 POP90C1 92 RFA_2 302 SEC5 459 CARDGIFT 44 RFA_22 246 LFC3 392 RFA_9 59 LFC6 177 HVP5 390 RFA_7 21 RFA_19 412 NUMPRM12 394 RFA_11 25 RFA_17 293 EC4 252 LFC9 1 MINRAMNT 154 HU5 144 DW8 36 HVP4 180 HUR2 233 TPE5 46 HHN2 176 HVP4 376 ADATE_16 32 NUMPRM12 248 LFC5 466 FISTDATE 33 EC4 137 DW1 373 ADATE_13 65 VOC1 150 HU1 172 ETHC6 49 AGE 303 AFC1 173 HVP1 52 RP2 311 VC3 1 ODATEDW 62 HU4 194 RP3 463 MAXRDATE 72 OEDC5 138 DW2 197 ADI 10 CARDGIFT 405 RFA_22 86 ETH3 57 EC1 47 WWIIVETS 93 ETH10 4 RFA_2F 169 ETHC3 268 EIC2 22 RFA_10 126 HHN2 198 DMA 24 CARDPROM 335 VOC2 92 ETH9 61 RFA_24 3 TCODE 388 RFA_5 17 RFA_5 143 DW7 212 IC14 53 AFC2 151 HU2 313 ANC1 68 EIC1 17 AGE 345 HC9 58 LASTDATE 340 HC4 85 ETH2 29 SEC5 250 LFC7 411 CARDPM12 75 STATE P a g e 27 | 100
  • 29. Donor Datamining | Jalaj Nautiyal Chi Sqaured Gains Ratio Information Gain Ranked attributes: Ranked attributes: Ranked attributes: 329 POBC2 81 POP90C3 28 RFA_13 134 MARR4 410 NUMPROM 56 IC14 193 RP2 171 ETHC5 6 RFA_2A 304 AFC2 221 IC23 70 MALEMILI 221 IC23 412 NUMPRM12 55 OCC9 312 VC4 187 HUPA3 12 ODATEDW 262 OCC9 3 TCODE 54 IC23 212 IC14 399 RFA_16 40 AFC1 145 DW9 332 LSC3 152 HU3 397 RFA_14 290 EC1 290 EC1 465 LASTDATE 91 ETH8 84 ETH1 334 VOC1 249 LFC6 356 HC20 190 HUPA6 235 TPE7 407 RFA_24 174 HVP2 153 HU4 465 LASTDATE 222 HHAS1 231 TPE3 355 HC19 339 HC3 334 VOC1 368 ADATE_8 131 MARR1 401 RFA_18 343 HC7 327 ANC15 125 HHN1 87 ETH4 80 POP90C2 35 MAGMALE 267 EIC1 331 LSC2 344 HC8 28 HIT 44 MALEMILI 238 PEC1 245 LFC2 154 HU5 157 HHD3 190 HUPA6 287 OEDC5 177 HVP5 I then conducted overlapping (intersection) analysis of the attributes across above evaluator methods. I then selected high ranked attributes across these evaluator results for selection of final (45 attributes) listed below. Final Attributes Attribute Description IC5 Per Capita Income ZIP Zipcode POP901 Number of Persons in donor’s neighborhood, as collected from the 1990 US Census. AVGGIFT Average dollar amount of gifts to date HV1 Median Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census. HV2 Average Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census. P a g e 28 | 100
  • 30. Donor Datamining | Jalaj Nautiyal Final Attributes Attribute Description IC4 Average Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. IC3 Average Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. IC2 Median Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. IC1 Median Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census. RAMNTALL Dollar amount of lifetime gifts to date DOB Date of birth of Donor OSOURCE Code indicating which mailing list the donor was originally acquired from RFA_4 Donor's RFA status as of 96TK promotion date from promotion history File RFA_6 Donor's RFA status as of 96LL promotion date from promotion history File RFA_8 Donor's RFA status as of 96GK promotion date from promotion history File RFA_3 Donor's RFA status as of 96NK promotion date FISTDATE Date of first gift from giving history file RFA_12 Donor's RFA status as of 96XK promotion date from promotion history File MAXRAMNT Dollar amount of largest gift to date from giving history file RFA_2 Donor's RFA status as of 97NK promotion date from promotion history File RFA_9 Donor's RFA status as of 96CC promotion date RFA_7 Donor's RFA status as of 96G1 promotion date RFA_11 Donor's RFA status as of 96X1 promotion date from promotion history File RFA_2A Donation Amount code for RFA_2 RFA_2F Frequency code for RFA_2 HVP2 Percent Home Value >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census. HVP6 Percent Home Value >= $300,000 in donor’s neighborhood, as collected from the 1990 US Census. RFA_18 Donor's RFA status as of 95GK promotion date WWIIVETS % WWII Vets HHAS3 Percent Households w/ Interest, Rental or Dividend Income in donor’s neighborhood, as collected from the 1990 US Census. HUR2 Percent >= 6 Room Housing Units in donor’s neighborhood, as collected from the 1990 US Census. NGIFTALL Number of lifetime gifts to date from promotion history File HVP5 Percent Home Value >= $50,000 in donor’s neighborhood, as collected from the 1990 US Census. CARDGIFT Number of lifetime gifts to card promotions to date LASTGIFT Dollar amount of most recent gift from giving history file P a g e 29 | 100
  • 31. Donor Datamining | Jalaj Nautiyal Final Attributes Attribute Description AFC1 Percent Adults in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census. AFC2 Percent Males in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census. IC23 Percent Families w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census. IC14 Percent Households w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census. EC1 Median Years of School Completed by Adults 25+ in donor’s neighborhood, as collected from the 1990 US Census. LASTDATE Date associated with the most recent gift VOC1 Percent Households w/ 1+ Vehicles in donor’s neighborhood, as collected from the 1990 US Census. POP90C2 Percent Population Outside Urbanized Area in donor’s neighborhood, as collected from the 1990 US Census. P a g e 30 | 100
  • 32. Donor Datamining | Jalaj Nautiyal 5.0 Models 5.1 Model-1 NaïveBayes 10 Fold Cross Validation Statistics Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 fold cross validation settings. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 8,843 1,156 a = 0 1,178 275 b = 1 • TP - 8,843 non-donor instances were correctly identified as non-donors by the model. • TN- 275 donor instances were correctly identified as donors by the model. • FP- 1,178 donor instances were incorrectly identified as non-donors by the model. • FN – 1,156 non-donor instances were incorrectly identified as donors by the model. P a g e 31 | 100
  • 33. Donor Datamining | Jalaj Nautiyal 5.2 Model-2: NaiveBayes 10 Fold Cross Validation with Test Set Statistics Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on test dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3265 458 a = 0 374 110 b = 1 • TP – 3,265 non-donor instances were correctly identified as non-donors by the model. • TN- 110 donor instances were correctly identified as donors by the model. • FP- 374donor instances were incorrectly identified as non-donors by the model. • FN – 458 non-donor instances were incorrectly identified as donors by the model. P a g e 32 | 100
  • 34. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3289 434 a = 0 367 117 b = 1 • TP – 3,289 non-donor instances were correctly identified as non-donors by the model. • TN- 117 donor instances were correctly identified as donors by the model. • FP- 367 donor instances were incorrectly identified as non-donors by the model. • FN – 434 non-donor instances were incorrectly identified as donors by the model. Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. P a g e 33 | 100
  • 35. Donor Datamining | Jalaj Nautiyal Other Models 5.3 Model-3 J48Graft Training Set Model with Test Set Statistics Algorithm: J48Graft algorithm was used. Test Options: • J48Graft algorithm model based on training set. • In addition, supplied test set was used for running the model on test dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3723 0 a = 0 484 0 b = 1 • TP – 3,723 non-donor instances were correctly identified as non-donors by the model. • TN- 0 donor instances were correctly identified as donors by the model. • FP- 484 donor instances were incorrectly identified as non-donors by the model. • FN – 0 non-donor instances were incorrectly identified as donors by the model. P a g e 34 | 100
  • 36. Donor Datamining | Jalaj Nautiyal J48Graft Training Set Model with Evaluation Set Statistics Algorithm: J48Graft algorithm was used. Test Options: • J48Graft algorithm model based on training set. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3723 0 a = 0 484 0 b = 1 • TP – 3,723 non-donor instances were correctly identified as non-donors by the model. P a g e 35 | 100
  • 37. Donor Datamining | Jalaj Nautiyal • TN- 0 donor instances were correctly identified as donors by the model. • FP- 484 donor instances were incorrectly identified as non-donors by the model. • FN – 0 non-donor instances were incorrectly identified as donors by the model. Snapshot of Tree The number of Non-donors in the training set for this algorithm were 11,452 and donors were 1453. When J48graft algorithm was ran the algorithm used number of non-donors as the root node which did not create any further classification beyond what is shown in the figure below. Conclusion: Model classified the number of non-donors and donors same as specified in the dataset. Model failed to calculate the True Negative and False Negative . P a g e 36 | 100
  • 38. Donor Datamining | Jalaj Nautiyal 5.4 Model-4 Decision Stump 10 Fold Cross Validation Model Statistics Algorithm: Decision Stump algorithm was used. Test Options: • Decision Stump algorithm with 10 fold cross validation settings. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 9999 0 a = 0 1453 0 b = 1 • TP – 9,999 non-donor instances were correctly identified as non-donors by the model. • TN- 0 donor instances were correctly identified as donors by the model. • FP- 1453 donor instances were incorrectly identified as non-donors by the model. • FN – 0 non-donor instances were incorrectly identified as donors by the model. P a g e 37 | 100
  • 39. Donor Datamining | Jalaj Nautiyal 5.5 Model-5 Decision Stump 10 Fold Cross Validation Model with Test Set Statistics Algorithm: Decision Stump algorithm was used. Test Options: • Decision Stump algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on test dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3732 0 a = 0 484 0 b = 1 • TP - 3732 non-donor instances were correctly identified as non-donors by the model. • TN - 0 donor instances were correctly identified as donors by the model. • FP - 484 donor instances were incorrectly identified as non-donors by the model. • FN - 0 non-donor instances were incorrectly identified as donors by the model. P a g e 38 | 100
  • 40. Donor Datamining | Jalaj Nautiyal Decision Stump Cross Validation Model with Evaluation Set Statistics Algorithm: Decision Stump algorithm was used. Test Options: • Decision Stump algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3723 0 a = 0 484 0 b = 1 • TP – 3723 non-donor instances were correctly identified as non-donors by the model. • TN- 0 donor instances were correctly identified as donors by the model. • FP- 484 donor instances were incorrectly identified as non-donors by the model. • FN – 0 non-donor instances were incorrectly identified as donors by the model. P a g e 39 | 100
  • 41. Donor Datamining | Jalaj Nautiyal Conclusion: • Model classified the number of non-donors and donors same as specified in the dataset. • Model failed to calculate the True Positive and False Negative. 5.6 Model-6 OneR 10 Fold Cross Validation Model Statistics Algorithm: OneR algorithm was used. Test Options: • OneR algorithm with 10 fold cross validation settings. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 9571 428 a = 0 1406 47 b = 1 • TP – 9,571 non-donor instances were correctly identified as non-donors by the model. • TN- 47 donor instances were correctly identified as donors by the model. • FP- 1406 donor instances were incorrectly identified as non-donors by the model. • FN - 428 non-donor instances were incorrectly identified as donors by the model. P a g e 40 | 100
  • 42. Donor Datamining | Jalaj Nautiyal 5.7 Model-7 OneR 10 Fold Cross Validation Model with Test Set Statistics Algorithm: OneR algorithm was used. Test Options: • OneR algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on test dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3564 159 a = 0 455 29 b = 1 • TP – 3564 non-donor instances were correctly identified as non-donors by the model. • TN- 29 donor instances were correctly identified as donors by the model. • FP- 455 donor instances were incorrectly identified as non-donors by the model. • FN - 159 non-donor instances were incorrectly identified as donors by the model. P a g e 41 | 100
  • 43. Donor Datamining | Jalaj Nautiyal OneR 10 Fold Cross Validation Model with Evaluation Set Statistics Algorithm: OneR algorithm was used. Test Options: • OneR algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3566 157 a = 0 454 30 b = 1 • TP – 3566 non-donor instances were correctly identified as non-donors by the model. • TN- 30 donor instances were correctly identified as donors by the model. • FP- 454 donor instances were incorrectly identified as non-donors by the model. • FN – 157 non-donor instances were incorrectly identified as donors by the model. Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. P a g e 42 | 100
  • 44. Donor Datamining | Jalaj Nautiyal 5.8 Model-8: ZeroR 10 Fold Cross Validation Model Statistics Algorithm: ZeroR algorithm was used. Test Options: • ZeroR algorithm with 10 fold cross validation settings. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 9999 0 a = 0 1453 0 b = 1 • TP – 9,999 non-donor instances were correctly identified as non-donors by the model. • TN- 0 donor instances were correctly identified as donors by the model. • FP- 1453 donor instances were incorrectly identified as non-donors by the model. • FN – 0 non-donor instances were incorrectly identified as donors by the model. P a g e 43 | 100
  • 45. Donor Datamining | Jalaj Nautiyal 5.9 Model-9 ZeroR 10 Fold Cross Validation Model with Test Set Statistics Algorithm: ZeroR algorithm was used to generate model. Test Options: • ZeroR algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on test dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. A B Classified as 3723 0 a = 0 484 0 b = 1 • TP – 3723 non-donor instances were correctly identified as non-donors by the model. • TN- 0 donor instances were correctly identified as donors by the model. • FP- 484 donor instances were incorrectly identified as non-donors by the model. • FN – 0 non-donor instances were incorrectly identified as donors by the model. P a g e 44 | 100
  • 46. Donor Datamining | Jalaj Nautiyal ZeroR 10 Fold Cross Validation Model with Evaluation Set Statistics Algorithm: ZeroR algorithm was used to generate model. Test Options: • ZeroR algorithm with 10 fold cross validation settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed Confusion Matrix for the model generated. a b Classified as 3723 0 a = 0 484 0 b = 1 • TP – 3723 non-donor instances were correctly identified as non-donors by the model. • TN- 0 donor instances were correctly identified as donors by the model. • FP- 484 donor instances were incorrectly identified as non-donors by the model. • FN – 0 non-donor instances were incorrectly identified as donors by the model. P a g e 45 | 100
  • 47. Donor Datamining | Jalaj Nautiyal Conclusion: • Model classified the number of nondonors and donors same as specified in the dataset. • Model failed to calculate the True Positive and False Negative. 6.0 Different number of attributes but same number of records using NaiveBayes Model Note: In all the below mentioned model TARGET_B attribute has been used. 6.1 5 Attributes IC5, ZIP, POP901, AVGGIFT, HV1 were the attributes used to create model. NaiveBayes 10 Fold Cross Validation Statistics for 5 attributes Algorithm: NaiveBayes algorithm was used to generate model. Test Options: NaiveBayes algorithm with 10 crossfold settings. Confusion Matrix: Following was the observed confusion matrix for the model generated. P a g e 46 | 100
  • 48. Donor Datamining | Jalaj Nautiyal a b Classified as 9795 204 a = 0 1415 38 b = 1 • TP – 9795 non-donor instances were correctly identified as non-donors by the model. • TN- 38 donor instances were correctly identified as donors by the model. • FP- 1415 donor instances were incorrectly identified as non-donors by the model. • FN – 204 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 5 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on test dataset. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3649 74 a = 0 473 11 b = 1 P a g e 47 | 100
  • 49. Donor Datamining | Jalaj Nautiyal • TP –3649 non-donor instances were correctly identified as non-donors by the model. • TN- 11 donor instances were correctly identified as donors by the model. • FP- 473 donor instances were incorrectly identified as non-donors by the model. • FN – 74 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 5 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3651 72 a = 0 471 13 b = 1 P a g e 48 | 100
  • 50. Donor Datamining | Jalaj Nautiyal • TP –3651 non-donor instances were correctly identified as non-donors by the model. • TN- 13 donor instances were correctly identified as donors by the model. • FP- 471 donor instances were incorrectly identified as non-donors by the model. • FN – 72 non-donor instances were incorrectly identified as donors by the model. Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. 6.2 10 Attributes IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1 were the attributes used to create model. NaiveBayes 10 Fold Cross Validation Statistics 10 attributes Algorithm: NaiveBayes algorithm was used to generate model. Test Options: NaiveBayes algorithm with 10 crossfold settings. P a g e 49 | 100
  • 51. Donor Datamining | Jalaj Nautiyal Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 9320 679 a = 0 1329 124 b = 1 • TP – 9320 non-donor instances were correctly identified as non-donors by the model. • TN- 124 donor instances were correctly identified as donors by the model. • FP- 1329 donor instances were incorrectly identified as non-donors by the model. • FN – 679 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 10 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on test dataset. P a g e 50 | 100
  • 52. Donor Datamining | Jalaj Nautiyal Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3472 251 a = 0 438 46 b = 1 • TP - 3472 non-donor instances were correctly identified as non-donors by the model. • TN- 46 donor instances were correctly identified as donors by the model. • FP- 438 donor instances were incorrectly identified as non-donors by the model. • FN – 251 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 10 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed confusion matrix for the model generated. P a g e 51 | 100
  • 53. Donor Datamining | Jalaj Nautiyal a b Classified as 3465 258 a = 0 439 45 b = 1 • TP - 3465non-donor instances were correctly identified as non-donors by the model. • TN- 45 donor instances were correctly identified as donors by the model. • FP- 439 donor instances were incorrectly identified as non-donors by the model. • FN – 258 non-donor instances were incorrectly identified as donors by the model. Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. 6.3 15 Attributes IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE, RFA_4, RFA_6 were the attributes used to create model. P a g e 52 | 100
  • 54. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation Statistics for 15 attributes Algorithm: NaiveBayes algorithm was used to generate model. Test Options: NaiveBayes algorithm with 10 crossfold settings. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 9493 506 a = 0 1351 102 b = 1 • TP - 9493non-donor instances were correctly identified as non-donors by the model. • TN- 102 donor instances were correctly identified as donors by the model. • FP- 1351donor instances were incorrectly identified as non-donors by the model. • FN - 506 non-donor instances were incorrectly identified as donors by the model. P a g e 53 | 100
  • 55. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 15 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on test dataset. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3525 198 a = 0 442 42 b = 1 • TP - 3525 non-donor instances were correctly identified as non-donors by the model. • TN- 42 donor instances were correctly identified as donors by the model. • FP- 442 donor instances were incorrectly identified as non-donors by the model. • FN - 198 non-donor instances were incorrectly identified as donors by the model. P a g e 54 | 100
  • 56. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 15 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. . Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3514 209 a = 0 445 39 b = 1 • TP - 3514 non-donor instances were correctly identified as non-donors by the model. • TN - 39 donor instances were correctly identified as donors by the model. • FP - 445 donor instances were incorrectly identified as non-donors by the model. • FN - 209 non-donor instances were incorrectly identified as donors by model. P a g e 55 | 100
  • 57. Donor Datamining | Jalaj Nautiyal Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. 6.4 20 Attributes IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE, RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT were the attributes used to create model. NaiveBayes 10 Fold Cross Validation for 20 attributes Algorithm: NaiveBayes algorithm was used to generate model. Test Options: NaiveBayes algorithm with 10 crossfold settings. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 9405 594 a = 0 1309 144 b = 1 P a g e 56 | 100
  • 58. Donor Datamining | Jalaj Nautiyal • TP - 9405 non-donor instances were correctly identified as non-donors by the model. • TN- 144 donor instances were correctly identified as donors by the model. • FP- 1309 donor instances were incorrectly identified as non-donors by the model. • FN - 594 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 20 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on test dataset. P a g e 57 | 100
  • 59. Donor Datamining | Jalaj Nautiyal Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3477 246 a = 0 431 53 b = 1 • TP - 3477 non-donor instances were correctly identified as non-donors by the model. • TN- 53 donor instances were correctly identified as donors by the model. • FP- 431 donor instances were incorrectly identified as non-donors by the model. • FN - 246 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 20 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. P a g e 58 | 100
  • 60. Donor Datamining | Jalaj Nautiyal Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3484 239 a = 0 431 53 b = 1 • TP - 3484 non-donor instances were correctly identified as non-donors by the model. • TN - 53 donor instances were correctly identified as donors by the model. • FP - 431 donor instances were incorrectly identified as non-donors by the model. • FN - 239 non-donor instances were incorrectly identified as donors by the model. Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. 6.5 25 Attributes IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE, RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9, RFA_7, RFA_11, RFA_2A were the attributes used to create model. P a g e 59 | 100
  • 61. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation Statistics for 25 attributes Algorithm: NaiveBayes algorithm was used. Test Options: NaiveBayes algorithm with 10 crossfold settings. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 9066 933 a = 0 1239 214 b = 1 • TP - 9066 non-donor instances were correctly identified as non-donors by the model. • TN- 214 donor instances were correctly identified as donors by the model. • FP- 1239 donor instances were incorrectly identified as non-donors by the model. • FN - 933 non-donor instances were incorrectly identified as donors by the model. P a g e 60 | 100
  • 62. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 25 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on test dataset. .Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3350 373 a = 0 408 76 b = 1 • TP - 3350 non-donor instances were correctly identified as non-donors by the model. • TN- 76 donor instances were correctly identified as donors by the model. • FP- 408 donor instances were incorrectly identified as non-donors by the model. • FN - 373 non-donor instances were incorrectly identified as donors by the model. P a g e 61 | 100
  • 63. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 25 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3380 343 a = 0 386 98 b = 1 • TP - 3380 non-donor instances were correctly identified as non-donors by the model. • TN - 98 donor instances were correctly identified as donors by the model. • FP - 386 donor instances were incorrectly identified as non-donors by the model. • FN - 343 non-donor instances were incorrectly identified as donors by the model. P a g e 62 | 100
  • 64. Donor Datamining | Jalaj Nautiyal Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. 6.5 30 Attributes IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE, RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9, RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS were the attributes used to create model. NaiveBayes 10 Fold Cross Validation 30 attributes Algorithm: NaiveBayes algorithm was used. Test Options: NaiveBayes algorithm with 10 crossfold settings. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 8970 1029 a = 0 1213 240 b = 1 P a g e 63 | 100
  • 65. Donor Datamining | Jalaj Nautiyal • TP - 8970 non-donor instances were correctly identified as non-donors by the model. • TN- 240 donor instances were correctly identified as donors by the model. • FP- 1213 donor instances were incorrectly identified as non-donors by the model. • FN - 1029 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 30 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on test dataset. .Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3305 418 a = 0 391 93 b = 1 P a g e 64 | 100
  • 66. Donor Datamining | Jalaj Nautiyal • TP - 3305 non-donor instances were correctly identified as non-donors by the model. • TN- 93 donor instances were correctly identified as donors by the model. • FP-391 donor instances were incorrectly identified as non-donors by the model. • FN - 418 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 30 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3332 391 a = 0 377 107 b = 1 P a g e 65 | 100
  • 67. Donor Datamining | Jalaj Nautiyal • TP - 3332 non-donor instances were correctly identified as non-donors by the model. • TN - 107 donor instances were correctly identified as donors by the model. • FP - 377 donor instances were incorrectly identified as non-donors by the model. • FN - 391 non-donor instances were incorrectly identified as donors by the model. Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. 6.6 35 Attributes IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE, RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9, RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2, NGIFTALL, HVP5, CARDGIFT were the attributes used to create model. NaiveBayes 10 Fold Cross Validation with 35 attributes P a g e 66 | 100
  • 68. Donor Datamining | Jalaj Nautiyal Algorithm: NaiveBayes algorithm was used to generate model. Test Options: NaiveBayes algorithm with 10 crossfold settings. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 8836 1136 a = 0 1187 266 b = 1 • TP - 8836 non-donor instances were correctly identified as non-donors by the model. • TN- 266 donor instances were correctly identified as donors by the model. • FP- 1187 donor instances were incorrectly identified as non-donors by the model. • FN - 1136 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 35 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. P a g e 67 | 100
  • 69. Donor Datamining | Jalaj Nautiyal • In addition, supplied test set was used for running the model on test dataset. .Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3274 449 a = 0 380 104 b = 1 • TP - 3274 non-donor instances were correctly identified as non-donors by the model. • TN- 104 donor instances were correctly identified as donors by the model. • FP-380 donor instances were incorrectly identified as non-donors by the model. • FN - 449 non-donor instances were incorrectly identified as donors by the model. NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 35 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. P a g e 68 | 100
  • 70. Donor Datamining | Jalaj Nautiyal Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3296 427 a = 0 372 112 b = 1 • TP - 3296 non-donor instances were correctly identified as non-donors by the model. • TN - 112 donor instances were correctly identified as donors by the model. • FP - 372 donor instances were incorrectly identified as non-donors by the model. • FN - 427 non-donor instances were incorrectly identified as donors by the model. Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. 6.7 40 Attributes IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE, RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9, RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2, NGIFTALL, HVP5, CARDGIFT, LASTGIFT, AFC1, AFC2, IC23, IC14 were the attributes used to create model. P a g e 69 | 100
  • 71. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation Statistics with 40 attributes Algorithm: NaiveBayes algorithm was used to generate model. Test Options: NaiveBayes algorithm with 10 crossfold settings. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 8835 1164 a = 0 1185 268 b = 1 • TP - 8835 non-donor instances were correctly identified as non-donors by the model. • TN- 268 donor instances were correctly identified as donors by the model. • FP- 1185 donor instances were incorrectly identified as non-donors by the model. • FN - 1164 non-donor instances were incorrectly identified as donors by the model. P a g e 70 | 100
  • 72. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 40 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on test dataset. .Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3259 464 a = 0 374 110 b = 1 • TP - 3259 non-donor instances were correctly identified as non-donors by the model. • TN- 110 donor instances were correctly identified as donors by the model. • FP-374 donor instances were incorrectly identified as non-donors by the model. • FN - 464 non-donor instances were incorrectly identified as donors by the model. P a g e 71 | 100
  • 73. Donor Datamining | Jalaj Nautiyal NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 40 attributes Algorithm: NaiveBayes algorithm was used. Test Options: • NaiveBayes algorithm with 10 crossfold settings. • In addition, supplied test set was used for running the model on evaluation dataset. Confusion Matrix: Following was the observed confusion matrix for the model generated. a b Classified as 3290 433 a = 0 368 116 b = 1 • TP - 3290 non-donor instances were correctly identified as non-donors by the model. • TN - 112 donor instances were correctly identified as donors by the model. • FP - 368 donor instances were incorrectly identified as non-donors by the model. • FN - 433 non-donor instances were incorrectly identified as donors by the model. P a g e 72 | 100
  • 74. Donor Datamining | Jalaj Nautiyal Note: • As we can see that the variation in accuracy between test dataset and evaluation dataset is not very high. • Hence model does not overfit the data. P a g e 73 | 100
  • 75. Donor Datamining | Jalaj Nautiyal 7.0 Performance Metrics 7.1 Calculations for Each Model – Precision and Sensitivity Model - Number Model Description 0 1 0 1 a b Classified as TP 8,843 275 TP 8,843 275 8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178 1,178 275 b = 1 PVV 0.882 0.192 Recall 0.884 0.189 0 1 0 1 a b Classified as TP 3,265 110 TP 3,265 110 3265 458 a = 0 FP 374 458 FN 458 374 374 110 b = 1 PVV 0.897 0.194 Recall 0.877 0.227 0 1 0 1 a b Classified as TP 3,289 117 TP 3,289 117 3289 434 a = 0 FP 367 434 FN 434 367 367 117 b = 1 PVV 0.900 0.212 Recall 0.883 0.242 0 1 0 1 a b Classified as TP 3,723 0 TP 3,723 0 3723 0 a = 0 FP 484 0 FN 0 484 484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 3,723 0 TP 3,723 0 3723 0 a = 0 FP 484 0 FN 0 484 484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 9,999 0 TP 9,999 0 9999 0 a = 0 FP 1,453 0 FN 0 1,453 1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 9,999 0 TP 9,999 0 9999 0 a = 0 FP 1,453 0 FN 0 1,453 1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 9,999 0 TP 9,999 0 9999 0 a = 0 FP 1,453 0 FN 0 1,453 1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 9,571 47 TP 9,571 47 9571 428 a = 0 FP 1,406 428 FN 428 1,406 1406 47 b = 1 PVV 0.872 0.099 Recall 0.957 0.032 0 1 0 1 a b Classified as TP 3,564 29 TP 3,564 29 3564 159 a = 0 FP 455 159 FN 159 455 455 29 b = 1 PVV 0.887 0.154 Recall 0.957 0.060 0 1 0 1 a b Classified as TP 3,566 30 TP 3,566 30 3566 157 a = 0 FP 454 157 FN 157 454 454 30 b = 1 PVV 0.887 0.160 Recall 0.958 0.062 5.7 Model-7 One R Cross Validation Model with Test Set Statistics One R Cross Validation Model with Evaluation Set Statistics 5.6 Model-6 OneRCross Validation Model Statistics 5.5 Model-5 Decision Stump Cross Validation Model with test set statistics Decision Stump Cross Validation Model with Evaluate set statistics J48Graft Training Set Model with Test set statistics J48Graft Training Set Model with Evaluate set statistics 5.3 Model-3 Decision Stump Cross Validation Model Statistics 5.4 Model-4 5.1 Model-1 NaïveBayes Ten Fold Cross Validation Statistics Confusion Matrix 5.2 Model-2 NaiveBayes 10 Fold Cross validation with Test set statistics. NaiveBayes 10 Fold Cross validation with Evaluation set statistics Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 = 𝑻𝑷 (𝑻𝑷 + 𝑭𝑷) 𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑻𝑷 (𝑻𝑷 + 𝑭𝑵) P a g e 74 | 100
  • 76. Donor Datamining | Jalaj Nautiyal Model - Number Model Description 0 1 0 1 a b Classified as TP 9,999 0 TP 9,999 0 9999 0 a = 0 FP 1,453 0 FN 0 1,453 1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 3,723 0 TP 3,723 0 3723 0 a = 0 FP 484 0 FN 0 484 484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 3,723 0 TP 3,723 0 3723 0 a = 0 FP 484 0 FN 0 484 484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000 0 1 0 1 a b Classified as TP 9,795 38 TP 9,795 38 9795 204 a = 0 FP 1,415 204 FN 204 1,415 1415 38 b = 1 PVV 0.874 0.157 Recall 0.980 0.026 0 1 0 1 a b Classified as TP 3,649 11 TP 3,649 11 3649 74 a = 0 FP 473 74 FN 74 473 473 11 b = 1 PVV 0.885 0.129 Recall 0.980 0.023 0 1 0 1 a b Classified as TP 3,651 13 TP 3,651 13 3651 72 a = 0 FP 471 72 FN 72 471 471 13 b = 1 PVV 0.886 0.153 Recall 0.981 0.027 0 1 0 1 a b Classified as TP 9,320 124 TP 9,320 124 9320 679 a = 0 FP 1,329 679 FN 679 1,329 1329 124 b = 1 PVV 0.875 0.154 Recall 0.932 0.085 0 1 0 1 a b Classified as TP 3,472 46 TP 3,472 46 3472 251 a = 0 FP 438 251 FN 251 438 438 46 b = 1 PVV 0.888 0.155 Recall 0.933 0.095 0 1 0 1 a b Classified as TP 3,465 46 TP 3,465 46 3465 258 a = 0 FP 439 258 FN 258 439 439 46 b = 1 PVV 0.888 0.151 Recall 0.931 0.095 0 1 0 1 a b Classified as TP 9,493 102 TP 9,493 102 9493 506 a = 0 FP 1,351 506 FN 506 1,351 1351 102 b = 1 PVV 0.875 0.168 Recall 0.949 0.070 0 1 0 1 a b Classified as TP 3,525 42 TP 3,525 42 3525 198 a = 0 FP 442 198 FN 198 442 442 42 b = 1 PVV 0.889 0.175 Recall 0.947 0.087 0 1 0 1 a b Classified as TP 3,514 39 TP 3,514 39 3514 209 a = 0 FP 445 209 FN 209 445 445 39 b = 1 PVV 0.888 0.157 Recall 0.944 0.081 NaiveBayes Cross Validation Model with 15 Attributes 5.14 Model-14 NaiveBayes Cross Validation Model with Test Set with 15 Attributes NaiveBayes Cross Validation Model with Evaluation Set with 15 Attributes 5.15 Model-15 NaiveBayes Cross Validation Model with 10 Attributes NaiveBayes Cross Validation Model with Test Set with 10 Attributes 5.12 Model-12 NaiveBayes Cross Validation Model with Evaluation Set with 10 Attributes 5.13 Model-13 NaïveBayes Cross Validation Model with 5 Attributes 5.10 Model-10 NaïveBayes Cross Validation Model with Test Set Statistics with 5 Attributes NaiveBayes Cross Validation Model with Evaluation Set Statistics with 5 Attributes 5.11 Model-11 5.8 Model-8 ZeroR Cross Validation Model Statistics 5.9 Model-9 ZeroR Cross Validation Model with Test Set Statistics ZeroR Cross Validation Model with Evaluation Set Statistics Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 = 𝑻𝑷 (𝑻𝑷 + 𝑭𝑷) 𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑻𝑷 (𝑻𝑷 + 𝑭𝑵) P a g e 75 | 100
  • 77. Donor Datamining | Jalaj Nautiyal Model - Number Model Description 0 1 0 1 a b Classified as TP 9,405 144 TP 9,405 144 9405 594 a = 0 FP 1,309 594 FN 594 1,309 1309 144 b = 1 PVV 0.878 0.195 Recall 0.941 0.099 0 1 0 1 a b Classified as TP 3,477 53 TP 3,477 53 3477 246 a = 0 FP 431 246 FN 246 431 431 53 b = 1 PVV 0.890 0.177 Recall 0.934 0.110 0 1 0 1 a b Classified as TP 3,484 53 TP 3,484 53 3484 239 a = 0 FP 431 239 FN 239 431 431 53 b = 1 PVV 0.890 0.182 Recall 0.936 0.110 0 1 0 1 a b Classified as TP 9,066 214 TP 9,066 214 9066 933 a = 0 FP 1,239 933 FN 933 1,239 1239 214 b = 1 PVV 0.880 0.187 Recall 0.907 0.147 0 1 0 1 a b Classified as TP 3,350 76 TP 3,350 76 3350 373 a = 0 FP 408 373 FN 373 408 408 76 b = 1 PVV 0.891 0.169 Recall 0.900 0.157 0 1 0 1 a b Classified as TP 3,380 98 TP 3,380 98 3380 343 a = 0 FP 386 343 FN 343 386 386 98 b = 1 PVV 0.898 0.222 Recall 0.908 0.202 0 1 0 1 a b Classified as TP 8,970 240 TP 8,970 240 8970 1029 a = 0 FP 1,213 1,029 FN 1,029 1,213 1213 240 b = 1 PVV 0.881 0.189 Recall 0.897 0.165 0 1 0 1 a b Classified as TP 3,305 93 TP 3,305 93 3305 418 a = 0 FP 391 418 FN 418 391 391 93 b = 1 PVV 0.894 0.182 Recall 0.888 0.192 0 1 0 1 a b Classified as TP 3,332 107 TP 3,332 107 3332 391 a = 0 FP 377 391 FN 391 377 377 107 b = 1 PVV 0.898 0.215 Recall 0.895 0.221 0 1 0 1 a b Classified as TP 8,863 266 TP 8,863 266 8863 1136 a = 0 FP 1,187 1,136 FN 1,136 1,187 1187 266 b = 1 PVV 0.882 0.190 Recall 0.886 0.183 0 1 0 1 a b Classified as TP 3,274 104 TP 3,274 104 3274 449 a = 0 FP 380 449 FN 449 380 380 104 b = 1 PVV 0.896 0.188 Recall 0.879 0.215 0 1 0 1 a b Classified as TP 3,296 112 TP 3,296 112 3296 427 a = 0 FP 372 427 FN 427 372 372 112 b = 1 PVV 0.899 0.208 Recall 0.885 0.231 0 1 0 1 A B Classified as TP 8,835 268 TP 8,835 268 8835 1164 a = 0 FP 1,185 1,164 FN 1,164 1,185 1185 268 b = 1 PVV 0.882 0.187 Recall 0.884 0.184 0 1 0 1 A B Classified as TP 3,259 110 TP 3,259 110 3259 464 a = 0 FP 374 464 FN 464 374 374 110 b = 1 PVV 0.897 0.192 Recall 0.875 0.227 0 1 0 1 A B Classified as TP 3,290 116 TP 3,290 116 3290 433 a = 0 FP 368 433 FN 433 368 368 116 b = 1 PVV 0.899 0.211 Recall 0.884 0.240 NaiveBayes Cross Validation Model with 40 Attributes NaiveBayes Cross Validation Model with Test Set with 40 Attributes NaiveBayes Cross Validation Model with Evaluation Set with 40 Attributes 5.24 Model-24 5.25 Model-25 NaiveBayes Cross Validation Model with 35 Attributes NaiveBayes Cross Validation Model with Test Set with 35 Attributes NaiveBayes Cross Validation Model with Evaluation Set with 35 Attributes 5.22 Model-22 5.23 Model-23 NaiveBayes Cross Validation Model with 30 Attributes 5.20 Model-20 NaiveBayes Cross Validation Model with Test Set with 30 Attributes NaiveBayes Cross Validation Model with Evaluation Set with 30 Attributes 5.21 Model-21 NaiveBayes Cross Validation Model with 25 Attributes NaiveBayes Cross Validation Model with Test Set with 25 Attributes NaiveBayes Cross Validation Model with Evaluation Set with 25 Attributes 5.18 Model-18 5.19 Model-19 NaiveBayes Cross Validation Model with 20 Attributes 5.16 Model-16 NaiveBayes Cross Validation Model with Test Set with 20 Attributes NaiveBayes Cross Validation Model with Evaluation Set with 20 Attributes 5.17 Model-17 Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 = 𝑻𝑷 (𝑻𝑷 + 𝑭𝑷) 𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑻𝑷 (𝑻𝑷 + 𝑭𝑵) P a g e 76 | 100
  • 78. Donor Datamining | Jalaj Nautiyal 7.2 Calculations for Each Model – Specificity and NPV Model - Number Model Description 0 1 0 1 a b Classified as TN 275 8,843 TN 275 8,843 8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178 1,178 275 b = 1 Specificity 0.189 0.884 NPV 0.192 0.882 0 1 0 1 a b Classified as TN 110 3,265 TN 110 3,265 3265 458 a = 0 FP 374 458 FN 458 374 374 110 b = 1 Specificity 0.227 0.877 NPV 0.194 0.897 0 1 0 1 a b Classified as TN 117 3,289 TN 117 3,289 3289 434 a = 0 FP 367 434 FN 434 367 367 117 b = 1 Specificity 0.242 0.883 NPV 0.212 0.900 0 1 0 1 a b Classified as TN 0 3,723 TN 0 3,723 3723 0 a = 0 FP 484 0 FN 0 484 484 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.885 0 1 0 1 a b Classified as TN 0 3,723 TN 0 3,723 3723 0 a = 0 FP 484 0 FN 0 484 484 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.885 0 1 0 1 a b Classified as TN 0 9,999 TN 0 9,999 9999 0 a = 0 FP 1,453 0 FN 0 1,453 1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873 0 1 0 1 a b Classified as TN 0 9,999 TN 0 9,999 9999 0 a = 0 FP 1,453 0 FN 0 1,453 1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873 0 1 0 1 a b Classified as TN 0 9,999 TN 0 9,999 9999 0 a = 0 FP 1,453 0 FN 0 1,453 1453 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.873 0 1 0 1 a b Classified as TN 47 9,571 TN 47 9,571 9571 428 a = 0 FP 1,406 428 FN 428 1,406 1406 47 b = 1 Specificity 0.032 0.957 NPV 0.099 0.872 0 1 0 1 a b Classified as TN 29 3,564 TN 29 3,564 3564 159 a = 0 FP 455 159 FN 159 455 455 29 b = 1 Specificity 0.060 0.957 NPV 0.154 0.887 0 1 0 1 a b Classified as TN 30 3,566 TN 30 3,566 3566 157 a = 0 FP 454 157 FN 157 454 454 30 b = 1 Specificity 0.062 0.958 NPV 0.160 0.887 5.7 Model-7 One R Cross Validation Model with Test Set Statistics One R Cross Validation Model with Evaluation Set Statistics 5.6 Model-6 OneRCross Validation Model Statistics 5.5 Model-5 Decision Stump Cross Validation Model with test set statistics Decision Stump Cross Validation Model with Evaluate set statistics J48Graft Training Set Model with Test set statistics J48Graft Training Set Model with Evaluate set statistics 5.3 Model-3 Decision Stump Cross Validation Model Statistics 5.4 Model-4 5.1 Model-1 NaïveBayes Ten Fold Cross Validation Statistics Confusion Matrix 5.2 Model-2 NaiveBayes 10 Fold Cross validation with Test set statistics. NaiveBayes 10 Fold Cross validation with Evaluation set statistics Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix 𝑵𝑷𝑽 = 𝑻𝑵 (𝑻𝑵 + 𝑭𝑵) 𝑺𝒑𝒆𝒄𝒊𝒇𝒊𝒄𝒊𝒕𝒚 = 𝑻𝑵 (𝑻𝑵 + 𝑭𝑷) P a g e 77 | 100