Veterans_DataMining

Donor Datamining
Course 4957 – Special Topic Datamining
Jalaj Nautiyal
7-9-2015

Donor Datamining | Jalaj Nautiyal
Table of Contents
1.0 Executive Summary..............................................................................................................................3
2.0 Data Set Description.............................................................................................................................4
2.1 Attributes.............................................................................................................................................4
2.1.1 Location ..................................................................................................................................4
2.1.2 Income Level ..........................................................................................................................5
2.1.3 Education ................................................................................................................................6
2.1.4 Median Home Value...............................................................................................................6
2.1.5 Number of Donor’s.................................................................................................................7
2.1.6 Dollar Amount of Gift ............................................................................................................7
2.1.7 Average Dollar Amounts of gifts............................................................................................8
2.1.8 Military Association................................................................................................................9
2.1.9 Type of Donor and RFA .........................................................................................................9
2.1.10 Dollar Gift in 97NK............................................................................................................10
2.1.11 Per Capita............................................................................................................................11
2.1.12 Correlation Matrix ..............................................................................................................11
2.1.13 Regression Coefficients ......................................................................................................12
2.2 Attribute Data-Type...........................................................................................................................13
3.0 Missing Data.......................................................................................................................................14
4.0 Attribute Selection..............................................................................................................................15
4.1 Methodology .....................................................................................................................................15
Step-1: Logic Based Selection.......................................................................................................15
Step-2: Attribute Filtration based on Multiple Attribute Selection Methods.................................26
5.0 Models ................................................................................................................................................31
5.1 Model-1.............................................................................................................................................31
5.2 Model-2:............................................................................................................................................32
5.3 Model-3 .............................................................................................................................................34
5.4 Model-4 .............................................................................................................................................37
5.5 Model-5 .............................................................................................................................................38
5.6 Model-6 .............................................................................................................................................40
5.8 Model-8: ............................................................................................................................................43
5.9 Model-9 .............................................................................................................................................44
P a g e 1 | 100

6.0 Different number of attributes but same number of records using NaiveBayes Model........................46
6.1 5 Attributes ....................................................................................................................................46
6.2 10 Attributes.....................................................................................................................................49
6.3 15 Attributes......................................................................................................................................52
6.4 20 Attributes......................................................................................................................................56
6.5 25 Attributes......................................................................................................................................59
6.5 30 Attributes......................................................................................................................................63
6.6 35 Attributes......................................................................................................................................66
6.7 40 Attributes......................................................................................................................................69
7.0 Performance Metrics.............................................................................................................................74
7.1 Calculations for Each Model – Precision and Sensitivity..................................................................74
7.2 Calculations for Each Model – Specificity and NPV ........................................................................77
7.3 Calculations for Each Model – Accuracy and F-Measure.................................................................80
7.4 Comparison of performance of different models based on different algorithms and settings...........83
7.5 Comparison of performance of NaiveBayes Algorithm with different number of attributes............90
8.0 Error Analysis.......................................................................................................................................97
8.1 NaiveBayes .......................................................................................................................................97
8.2 OneR Model......................................................................................................................................98
P a g e 2 | 100

1.0 Executive Summary
As part of this project we were provided with veteran’s dataset. Dataset consisted of many different
attributes collected from various sources like census, mailing list etc. among others. The objective
of the project is to utilize the dataset provided and identify probable donor’s.
Various datamining concepts taught in the course were utilized for the purpose of analyzing the
dataset, modelling and identification of probable donor’s. The dataset was first analyzed for data-
type and missing data. Weka was used to preprocess, analyze and interpret the data.
Some of the important attributes were selected from the target dataset and analyzed in depth to
understand the relation/cross correlation among the attributes and how much of the variations in
the selected attributes defines probability of future donor.
For selection of attributes which can predict probable donor, two pronged approach was used. In
first iteration, logic based selection of the attributes from target dataset was conducted. In addition
to this, Weka’s ChiSquared, GainRationAttributeEval, InfoGainAttributeEval were used to
identify top ranked attributes. I then compared the results from these two approaches and
included/excluded few attributes from final list of attributes. The methodology and results are
shared in detail in the paper.
Final list of attributes were used to run on NaiveBayes, J48Graft, DecisionStump, OneR and ZeroR
algorithms to ascertain the attributed selected by above methodology are useful to predict probable
donor. The methodology and results are shared in detail in the paper.
After utilizing above algorithms and generated models, the models were applied to the final dataset
and various statistics were run on the final dataset in Weka to see the accuracy of model fit with
the dataset. A list of control numbers were selected from this final dataset and submitted.
P a g e 3 | 100

2.0 Data Set Description
We have a dataset with 47,705 records. Each record has 441 columns. Different attributes and
associated statistics are displayed below.
2.1 Attributes
2.1.1 Location
Geographic location of a particular person is important for ascertaining probable future donor and
out of many possible attribute in dataset, I analyzed states and impact of various income levels to
get distribution of Family and Household incomes – few states analyzed are as follows.
Figure 1 Income Distribution-California
Figure 2 Income Distribution - Colorado
P a g e 4 | 100

Figure 3 Income Distribution - Florida
2.1.2 Income Level
Expendable income is an important consideration for estimating future donor and hence number
of people below poverty line analyzed for each state also gives idea about the states for probable
donor’s.
Figure 4 Income Level
P a g e 5 | 100

2.1.3 Education
Education is an important parameter in my analysis and a proxy to level of education is the number
of magazine subscription. Higher the education higher is the probability of the person becoming
probable donor.
Figure 5 Magazine Subscription
2.1.4 Median Home Value
Income is an important parameter to ascertain probable donor and following chart shows the
distribution of HV1 – Median home value across 50 states. Higher the value of HV1 more
probable will be the state from which probable donor can be obtained.
Figure 6 Median Home Value
P a g e 6 | 100

2.1.5 Number of Donor’s
Analyzing the target dataset provided the current number of donor’s across different states is
important information to ascertain future probable donor and the following chart provides the
information to help on the analysis.
Figure 7 Number of Donor’s
2.1.6 Dollar Amount of Gift
The dollar amount of lifetime gifts to date by current donor’s is important attribute to estimate the
probability of future donor. RAMNTALL is the dollar amount of lifetime gifts to date and the
following chart shows the average values of RAMNTALL in different states.
Figure 8 Life Time Dollar Amount
P a g e 7 | 100

LASTGIFT is an attribute for dollar amount of most recent gift. This is important to get an idea
about how recent the current donor gifted. Following chart shows the Average LASTGIFT for
different states.
Figure 9 Average Gift Amount of most recent gift
2.1.7 Average Dollar Amounts of gifts
Average dollar amounts of gifts to date is also important attribute to help estimate future donor.
Following chart shows the average of average dollar amounts (AVGGIFT) of gifts to date for
different states.
Figure 10 Avg Dollar Amt of gifts to date
P a g e 8 | 100

2.1.8 Military Association
Association with military in past or current is important attribute to help us estimate the probable
donor. WWIIVETS is the attribute which gives percentage of WWII vets. Following chart shows
the average of the percentage of WWII vets for different states.
Figure 11 Avg World War II Vets
2.1.9 Type of Donor and RFA
Following chart provides information on Super donor’s who have participated in RFA from
RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective
to ascertain probable donor.
Figure 12 Number of Super Donor’s for RFA_3 to RFA_23
Following chart provides information on Active donor’s who have participated in RFA from
RFA_3 to RFA_23 for different states and RFA is an important attribute to answer our objective
to ascertain probable donor.
P a g e 9 | 100

Figure 13 Number of Active Donor’s for RFA_3 to RFA_23
2.1.10 Dollar Gift in 97NK
TARGET_D is the dollar amount associated with response to 97NK mailing. The chart below
shows sum of TARGET_D for different states. This information is important as it gives idea about
the spending of the current donor’s.
Figure 14 Sum of Dollar Gift Amt to 97 Mailing list
P a g e 10 | 100

2.1.11 Per Capita
Per capita income of the donor’s(IC-5) is important attribute to ascertain future donor. Following
chart shows average per capita income for donor’s across different states.
Figure 15 Per Capita Income
2.1.12 Correlation Matrix
Following table shows the correlation matrix for some of the attributes to understand how
different attributes impact the person being probable donor. This gave me idea to select some of
the attributes worth considering for final model selection.
Figure 16 Correlation Matrix
P a g e 11 | 100

2.1.13 Regression Coefficients
Based on the correlation coefficients, I ran excel regression to ascertain how these attributes are
representative of Target_B and which variables have explanatory power as the starting point for
filtering the number of attributes.
P a g e 12 | 100

2.2 Attribute Data-Type
Change of attribute datatype (ask if writing ‘varchar to in’ is appropriate or not)
• I changed the datatype of attributes from IC6 to IC23 from varchar to int. As I wanted to
perform various calculations like sum, average etc. on the values of these attributes.
• I changed the datatype of attribute HHSA4 from varchar to int to perform various
calculations like sum, average etc. on the values of this attribute.
• Table1 below describes few more attributes for which I changed the value from varchar to
int so that I can perform various calculations like sum, average etc. on the values of these
attributes.
Table1
Attribute
Original
Datatype
Changed
Datatype
MBGARDEN varchar int
MBCRAFT varchar int
MBBOOKS varchar int
MBCOLECT varchar int
MAGFAML varchar int
MAGFEM varchar int
MAGMALE varchar int
PUBGARDN varchar int
PUBGARDN varchar int
PUBHLTH varchar int
PUBDOITY varchar int
PUBNEWFN varchar int
PUBPHOTO varchar int
PUBOPP varchar int
P a g e 13 | 100

3.0 Missing Data
Analyzing the target dataset, I uncovered following missing data which are listed with the
explanation of data processing done to work with different algorithm.
3.1 Gender
I used TCODE value to identify gender for the records which had missing values of gender.
3.2 Zip
I corrected the zip attribute’s value. For some zip values, it had ‘-‘ at the end of its values. Zip
values were corrected by removing this ‘-’.
3.3 Age and DOB
There were many null values for age attribute. We have a DOB (DateOfBirth) attribute in
our dataset. I referred DOB attribute to find out the missing values of age attribute. But, for
all the missing values of age attribute, the corresponding DOB value was also missing.
I tried to replace the age value using some statistics. I tried to calculate the average age of
current donor’s with respect to state. Since the data with null value of age was very large, I
thought it won’t be a good idea to replace the missing age values as replacing missing age
value might bias the dataset to a great margin.
P a g e 14 | 100

4.0 Attribute Selection
4.1 Methodology
Step-1: Logic Based Selection
For attribute selection, I analyzed the target dataset with the objective to identify the possible
donor. Some of the assumptions I used when selecting the attribute from list of possible attributes
are listed below. These become the evaluating conditions for my logic based attribute selection:
1. Higher Education implies higher probability of person to be future donor
2. Higher Income(per capita /income level/number of vehicles/median income/number of
employed persons etc.) implies higher probability of person to be future donor
3. House in good locality implies higher probability of person to be future donor
4. Bigger house implies higher probability of person to be future donor
5. Person renting and paying high rent implies higher probability of person to be future donor
6. Person in active duty or if person served in military in past implies higher probability of
person to be future donor
7. Dollar amount of gift, frequency of gift, recency of gift all gives good idea about
probability of future donor/donations
8. Number, Type and recency of promotions and responsiveness of a person to these efforts
gives good idea about probability of future donor/donations.
9. I also identified some attributes which were negatively correlated with the possibility of
future donations, thus making these variables equally important to estimate future donor:
a. Persons living on social security will probably not be a future donor
b. Persons working in professions where there is not much expendable income will
probably not be a future donor
c. Persons living in rural areas are less likely to donate.
I identified in total 122 attributes out of total 448 possible attributes.
After selection of these 122 attributes, I applied attribute selection – Chi Squared Attribute
Evaluator in Weka to cross reference my logic based selection and apply learnings from the class.
The output from Chi Squared Attribute evaluator was analyzed and compared with my subset of
122 attributes and following are list of 95 attributes which matched my logic based selection and
Chi Squared Attribute evaluator output.
4.2 Included Attributes
Attribute Name Attribute Description Reason
TARGET_D
Donation Amount (in $) associated with the
Response to 97NK Mailing
Donor amount gives indication of the
amount of donation a donor can
provide based on the 1997 donation
history
P a g e 15 | 100

IC5 Per Capita Income
Per capita income gives indication of
the wealth of the donor base. Higher
the per capita income likely will be
the donation probability
ZIP Zipcode
Zip code provides the likelihood of
donor’s with similar income range
and probability of donation
POP901
Number of Persons in donor’s neighborhood, as
collected from the 1990 US Census.
Number of person’s in donor
neighborhood is indicative of income
range as richer neighborhood tends to
have lower number of persons.
AVGGIFT Average dollar amount of gifts to date
Average amount of gift to date is an
attribute which provides good
indication of probability of donation.
HV1
Median Home Value in hundreds in donor’s
neighborhood, as collected from the 1990 US
Census.
Median home value is indication of
income and higher the median home
value higher probability of donation
HV2
Average Home Value in hundreds in donor’s
Census.
Average home value is indication of
income and higher the median home
value higher probability of donation
RAMNTALL Dollar amount of lifetime gifts to date
Total Dollar amount of gift to date is
an attribute which provides good
indication of probability of donation.
IC3
Average Household Income in hundreds in donor’s
Census.
Average household income is
indication of income and higher the
average household income higher
probability of donation
IC2
Median Family Income in hundreds in donor’s
Census.
Median Family income is indication
of income and higher the median
family income higher probability of
donation
IC4
Average Family Income in hundreds in donor’s
Census.
Average Family income is indication
of income and higher the average
family income higher probability of
donation
IC1
Median Household Income in hundreds in donor’s
Census.
Median household income is
median household income higher
OSOURCE
Code indicating which mailing list the donor was
originally acquired from
This attribute indicates the chances of
donation based on the marketing
approach to the donor.
LASTGIFT
Dollar amount of most recent gift from giving
history file
Dollar amount of recent gift is an
indication of probability of donation
as higher dollar amount of recent gift
P a g e 16 | 100

gives higher likelihood of donor to
repeat donation
MAXRAMNT
Dollar amount of largest gift to date from giving
history file
Dollar amount of largest gift is an
indication of probability of donation
as larger the gift amount higher will
be the likelihood of donor to repeat
donation
RFA_3 Donor's RFA status as of 96NK promotion date
RFA is good measure of the
probability of the donor to repeat the
donation.
RFA_4
Donor's RFA status as of 96TK promotion date
from promotion history File
donation.
RFA_6
Donor's RFA status as of 96LL promotion date from
promotion history File
donation.
RFA_8
Donor's RFA status as of 96GK promotion date
donation.
RFA_2
Donor's RFA status as of 97NK promotion date
donation.
FISTDATE Date of first gift from giving history file
The date of first gift gives an idea of
how long the donor has been involved
in donation is good attribute for
estimating future donations.
RFA_12
Donor's RFA status as of 96XK promotion date
donation.
MINRAMNT
Dollar amount of smallest gift to date from giving
history file
Dollar amount of the smallest gift is
negatively correlated to the future
donation and hence is a good attribute
to estimate future donation
RFA_11
Donor's RFA status as of 96X1 promotion date from
promotion history File
donation.
NGIFTALL
Number of lifetime gifts to date from promotion
history File
Number of lifetime gifts to date is
good measure to estimate future
donation as higher the number of
lifetime gifts higher will be the
probability of future donations.
RFA_2F Frequency code for RFA_2
Frequency of the RFA measure
provides idea about how frequent has
the past donations by the donor and
P a g e 17 | 100

hence is good estimator for future
donations
RFA_7 Donor's RFA status as of 96G1 promotion date
donation.
RFA_2A Donation Amount code for RFA_2
Amount of the RFA measure provides
idea about how much has the past
donations by the donor and hence is
good estimator for future donations
RFA_9 Donor's RFA status as of 96CC promotion date
donation.
MAXRDATE Date associated with the largest gift to date
Date associated with the largest gift is
important to understand how long ago
has the donor donated the largest gift
and is good estimator of future
donation
NUMPROM Lifetime number of promotions received to date
Lifetime number of promotions is
good estimator for future donations as
it shows how much the donor is
responsive to the marketing effort for
donation
CARDGIFT Number of lifetime gifts to card promotions to date
Number of lifetime gifts to card
promotions is good estimator for
future donations as it shows how
much the donor is responsive to the
marketing effort for donation
RFA_16 Donor's RFA status as of 95LL promotion date
donation.
ODATEDW Date of donor's first gift
The date of first gift gives an idea of
how long the donor has been involved
in donation is good attribute for
estimating future donations.
NEXTDATE Date of second gift
The date of second gift gives an idea
of how long the donor has been
involved in donation is good attribute
for estimating future donations.
donation.
HVP1
Percent Home Value >= $200,000 in donor’s
Census.
Value of home is indication of
income and higher the home value
higher probability of donation
RFA_18 Donor's RFA status as of 95GK promotion date
donation.
P a g e 18 | 100

RFA_5 Donor's RFA status as of 96SK promotion date
donation.
MINRDATE Date associated with the smallest gift to date
Date of the smallest gift is negatively
correlated to the future donation and
hence is a good attribute to estimate
future donation
HVP2
Census.
HVP6
Census.
donation.
RFA_10 Donor's RFA status as of 96WL promotion date
donation.
RP1 Percent Renters Paying >= $500 per Month
Rent of home is indication of income
and higher the home rent higher
CARDPROM
Lifetime number of card promotions received to
date.
Number of lifetime gifts to card
promotions is good estimator for
donation.
HHAS3
Percent Households w/ Interest, Rental or Dividend
Income in donor’s neighborhood, as collected from
the 1990 US Census.
Households with interest, rental or
dividend income is indication of
income and higher the attribute higher
HVP3
Census.
RFA_13 Donor's RFA status as of 95FS promotion date
donation.
SEC5
Percent Persons in College in donor’s
Census.
Percent persons in college is
indication of education and better the
education higher will be probability
of donation
LFC3
Percent Females in Labor Force in donor’s
Census.
Percent female in labor is indication
of income and higher the percent in
P a g e 19 | 100

labor force higher will be probability
of donation
HVP5
Census.
NUMPRM12
Number of promotions received in the last 12
months
Number of promotions received in
last 12 months is good estimator for
EC4
Percent Adults 25+ Completed High School or
Equivalency in donor’s neighborhood, as collected
from the 1990 US Census.
Percent adults completing high school
or equivalent education is indication
of education and better the education
higher will be probability of donation
HU5
Percent Seasonal/Recreational Vacant Units in
donor’s neighborhood, as collected from the 1990
US Census.
Percent Seasonal/Recreational vacant
unit is indication of income and
higher the attribute higher probability
of donation
HUR2
Percent >= 6 Room Housing Units in donor’s
Census.
Percent 6+ room houses is indication
of income and higher the attribute
HVP4
Census.
LFC5
Percent Adult Females Employed in donor’s
Census.
Percent female in labor is indication
of income and higher the percent in
labor force higher will be probability
of donation
DW1
Percent Single Unit Structure in donor’s
Census.
Percent Single unit structure is
HU1
Percent Owner Occupied Housing Units in donor’s
Census.
Percent Owner occupied housing is
AFC1
Percent Adults in Active Military Service in donor’s
Census.
Percent Adults in active military
service is indication of association
with military and higher the percent
in active military service higher will
be probability of donation
VC3
Percent WW2 Veterans Age 16+ in donor’s
Census.
Percent WW2 veterans is indication
of association with military and
higher the percent of WW2 veterans
P a g e 20 | 100

RP3
Percent Renters Paying >= $300 per Month in
US Census.
DW2
Percent Detached Single Unit Structure in donor’s
Census.
Percent detached single unit structure
is negatively correlated to the future
RFA_22 Donor's RFA status as of 95XK promotion date
donation.
WWIIVETS % WWII Vets
Percent WW2 veterans is indication
of association with military and
higher the percent of WW2 veterans
HHN2
Percent 2 Person Households in donor’s
Census.
Percent 2 person household is
indication of no future liability on
part of donor and hence the person is
more likely to be future donor
VOC2
Percent Households w/ 2+ Vehicles in donor’s
Census.
Percent household with 2+ vehicles is
number of vehicles higher probability
of donation
HU2
Percent Renter Occupied Housing Units in donor’s
Census.
Percent renter occupied housing is
positively correlated to the future
to estimate future donation provided
the rent paid is high
AGE Overlay Age
Higher the age higher the probability
of future donations.
HC4
Percent Owner Occupied Structures Built Since
1985 in donor’s neighborhood, as collected from the
1990 US Census.
to estimate future donation as owner
has only one asset
LFC7
Percent 2 Parent Earner Families in donor’s
Census.
Percent 2 parent earner family is
percent higher probability of donation
RP2
Percent Renters Paying >= $400 per Month in
US Census.
AFC2
Percent Males in Active Military Service in donor’s
Census.
Percent males in active military
service is indication of association
with military and higher the percent
of attribute higher will be probability
of donation
P a g e 21 | 100

IC23
Percent Families w/ Income >= $150,000 in
US Census.
Percent families with income >150k
is indication of income and higher the
attribute higher probability of
donation
OCC9
Percent Farmers in donor’s neighborhood, as
Percent farmers is negatively
correlated to the future donation and
hence is a good attribute to estimate
future donation as farmers generally
do not have expendable surplus
IC14
Percent Households w/ Income >= $150,000 in
US Census.
Percent households with income
>150k is indication of income and
higher the attribute higher probability
of donation
EC1
Median Years of School Completed by Adults 25+
in donor’s neighborhood, as collected from the 1990
US Census.
Median years of school completed is
indication of education and better the
education higher will be probability
of donation
LASTDATE Date associated with the most recent gift
The date of recent gift gives an idea
of how long the donor has been
involved in donation is good attribute
for estimating future donations.
LFC6
Percent Mothers Employed Married and Single in
US Census.
Percent mothers employed, marries
and single gives idea about the
income, liability of the neighborhood
and is good indication of future
donations.
HUPA6
Percent Renter Occupied, 5+ Units in donor’s
Census.
Percent renter occupied 5+ units is
attribute higher probability of
donation
donation.
HU4
Percent Vacant Housing Units in donor’s
Census.
Percent vacant housing units in donor
neighborhood is good indication of
future donations as higher the vacant
units implies higher number of
second homes.
HHAS1
Percent Households on Social Security in donor’s
Census.
Percent households on social security
is negatively correlated to the future
to estimate future donation as person
on social security seldom donate.
HC19
Percent Housing Units w/ Public Sewer Source in
US Census.
Percent housing with public sewer is
to estimate future donation as
P a g e 22 | 100

housings with public sewer generally
are low income housing
VOC1
Percent Households w/ 1+ Vehicles in donor’s
Census.
Percent household with 1+ vehicles is
number of vehicles higher probability
of donation
HC7
Percent Owner Occupied Structures Built Since
1990 US Census.
has only one asset
POP90C2
Percent Population Outside Urbanized Area in
US Census.
Percent outside urbanized area is
to estimate future donation as people
outside urbanized area seldom donate
EIC1
Percent Employed in Agriculture in donor’s
Census.
Percent employed in agriculture is
to estimate future donation as people
employed in agriculture seldom has
expendable income
HC8
Percent Owner Occupied Structures Built Prior to
1990 US Census.
has only one asset
MALEMILI
% Males active in the Military in donor’s
Census.
Percent males active in military is
indication of association with military
and higher the percent of percent
LFC2
Percent Adult Males in Labor Force in donor’s
Census.
Percent adult males in labor force is
percent higher will be probability of
donation
OEDC5
Percent Private Profit Wage or Salaried Worker in
US Census.
Percent private profit wage or salaried
worker is indication of income and
higher the percent higher will be
probability of donation as generally
private profit wage is high and person
has expendable income
In analyzing the output from Chi Squared Attribute evaluator output I also eliminated some of
the high rank attributes from the output generated by Weka. The rationale behind the elimination
was either due to irrelevance to the objective of identifying future donor or due to
multicollinearity in the attributes (more than one attributes conveying same information). The
P a g e 23 | 100

purpose is to select appropriate number of attributes which help predict future donor. Following
is the list of attribute which were eliminated from my final attribute selection with reason for
elimination.
4.3 Omitted Attributes
CONTROLN Control number (unique record identifier)
This is a unique identifier number and
adds no value in identifying future donor.
POP903
Number of Households in donor’s neighborhood,
as collected from the 1990 US Census.
This attribute is multicollinear with
attribute in our final attribute list
(Number of person in neighborhood).
Thus adding no additional value in
identifying future donor.
POP902
Number of Families in donor’s neighborhood, as
(Number of person in neighborhood –
POP903). Thus adding no additional
value in identifying future donor.
DOB Date of birth of Donor
attribute in our final attribute list (AGE).
Thus adding no additional value in
identifying future donor.
HHP2
Average Person Per Household in donor’s
Census.
HHP1
Median Person Per Household in donor’s
Census.
MSA MSA Code
This is some kind of code and adds no
ADI ADI Code
DMA DMA Code
TPE13 Percent Traveling 15 - 59 Minutes to Work
This attribute adds no value in identifying
future donor as traveling time doesn’t
decide if a person will be future donor.
ETHC3
Percent White Age 60+ in donor’s neighborhood,
future donor as person’s race doesn’t
P a g e 24 | 100

TCODE Donor title code
future donor as person’s title doesn’t
DW7
Percent Group Quarters in donor’s neighborhood,
future donor as group quarters doesn’t
POBC2
Percent Born in State of Residence in donor’s
Census.
future donor as person’s affinity to a
place doesn’t decide if a person will be
future donor.
MARR4
Percent Never Married in donor’s neighborhood,
future donor as marriage doesn’t decide if
a person will be future donor.
VC4
Percent Veterans Serving After May 1975 Only
in donor’s neighborhood, as collected from the
1990 US Census.
This attribute is multicollinear with more
than one attribute in our final attribute list
(Number of military service persons,
active military etc.). Thus adding no
additional value in identifying future
donor.
DW9
Non-Institutional Group Quarters in donor’s
Census.
HU3
Percent Occupied Housing Units in donor’s
Census.
This attribute is multicollinear with more
than one attribute in our final attribute list
(Number of vacant house, house type,
number of houses etc.). Thus adding no
additional value in identifying future
donor.
ETH1
Percent White in donor’s neighborhood, as
future donor as person’s race doesn’t
MARR1
Percent Married in donor’s neighborhood, as
HHN1
Percent 1 Person Households in donor’s
Census.
future donor as living alone doesn’t
HHD3
Percent Married Couple Families in donor’s
Census.
P a g e 25 | 100

Step-2: Attribute Filtration based on Multiple Attribute Selection Methods.
As the number of attributes (95 attributes) resulted in unsatisfied statistics on the trained model. I
wanted to run Attribute selection methods available in Weka to gain better understanding of the
impact of attributes on model accuracy.
I ran Chi Squared, GainRationAttributeEval, InfoGainAttributeEval and following are the results
of these three evaluators from Weka.
Chi Sqaured Gains Ratio Information Gain
Ranked attributes: Ranked attributes: Ranked attributes:
472 TARGET_D 472 TARGET_D 73 IC5
470 CONTROLN 362 ADATE_2 74 ZIP
203 IC5 363 ADATE_3 76 POP901
5 ZIP 470 CONTROLN 79 HV2
76 POP901 203 IC5 78 HV1
469 AVGGIFT 5 ZIP 77 AVGGIFT
146 HV1 14 MDMAUD 83 IC4
147 HV2 76 POP901 82 IC2
78 POP903 469 AVGGIFT 81 IC3
77 POP902 366 ADATE_6 84 IC1
457 RAMNTALL 147 HV2 80 RAMNTALL
201 IC3 146 HV1 85 OSOURCE
200 IC2 78 POP903 93 FISTDATE
202 IC4 77 POP902 90 RFA_6
199 IC1 364 ADATE_4 8 MAXRDATE
8 DOB 95 ETH12 47 VOC2
2 OSOURCE 8 DOB 9 NUMPROM
464 LASTGIFT 41 PUBPHOTO 91 RFA_8
462 MAXRAMNT 457 RAMNTALL 94 RFA_12
386 RFA_3 475 RFA_2F 15 HVP1
136 HHP2 476 RFA_2A 86 LASTGIFT
135 HHP1 380 ADATE_20 88 RFA_3
387 RFA_4 202 IC4 2 RFA_11
389 RFA_6 201 IC3 45 WWIIVETS
391 RFA_8 479 MDMAUD_A 42 RP3
196 MSA 200 IC2 18 MINRDATE
385 RFA_2 199 IC1 67 POP90C2
466 FISTDATE 464 LASTGIFT 89 RFA_4
395 RFA_12 98 ETH15 30 LFC3
460 MINRAMNT 2 OSOURCE 5 RFA_7
394 RFA_11 385 RFA_2 87 MAXRAMNT
458 NGIFTALL 462 MAXRAMNT 13 NEXTDATE
475 RFA_2F 145 DW9 7 RFA_9
390 RFA_7 367 ADATE_7 34 HU5
476 RFA_2A 386 RFA_3 64 HC19
392 RFA_9 387 RFA_4 60 HUPA6
463 MAXRDATE 409 MAXADATE 66 HC7
197 ADI 460 MINRAMNT 69 HC8
410 NUMPROM 80 POP90C2 38 DW1
P a g e 26 | 100

198 DMA 305 AFC3 26 HHAS3
459 CARDGIFT 389 RFA_6 41 VC3
399 RFA_16 323 ANC11 63 HHAS1
1 ODATEDW 303 AFC1 43 DW2
467 NEXTDATE 478 MDMAUD_F 35 HUR2
397 RFA_14 391 RFA_8 20 HVP6
173 HVP1 90 ETH7 3 NGIFTALL
401 RFA_18 44 MALEMILI 19 HVP2
388 RFA_5 196 MSA 16 RFA_18
461 MINRDATE 234 TPE6 71 LFC2
174 HVP2 304 AFC2 37 LFC5
178 HVP6 178 HVP6 14 RFA_14
243 TPE13 9 NOEXCH 31 HVP5
402 RFA_19 135 HHP1 51 LFC7
393 RFA_10 29 MBCRAFT 27 HVP3
192 RP1 379 ADATE_19 50 HC4
408 CARDPROM 143 DW7 39 HU1
400 RFA_17 458 NGIFTALL 11 RFA_16
224 HHAS3 395 RFA_12 48 HU2
175 HVP3 136 HHP2 23 RP1
396 RFA_13 79 POP90C1 92 RFA_2
302 SEC5 459 CARDGIFT 44 RFA_22
246 LFC3 392 RFA_9 59 LFC6
177 HVP5 390 RFA_7 21 RFA_19
412 NUMPRM12 394 RFA_11 25 RFA_17
293 EC4 252 LFC9 1 MINRAMNT
154 HU5 144 DW8 36 HVP4
180 HUR2 233 TPE5 46 HHN2
176 HVP4 376 ADATE_16 32 NUMPRM12
248 LFC5 466 FISTDATE 33 EC4
137 DW1 373 ADATE_13 65 VOC1
150 HU1 172 ETHC6 49 AGE
303 AFC1 173 HVP1 52 RP2
311 VC3 1 ODATEDW 62 HU4
194 RP3 463 MAXRDATE 72 OEDC5
138 DW2 197 ADI 10 CARDGIFT
405 RFA_22 86 ETH3 57 EC1
47 WWIIVETS 93 ETH10 4 RFA_2F
169 ETHC3 268 EIC2 22 RFA_10
126 HHN2 198 DMA 24 CARDPROM
335 VOC2 92 ETH9 61 RFA_24
3 TCODE 388 RFA_5 17 RFA_5
143 DW7 212 IC14 53 AFC2
151 HU2 313 ANC1 68 EIC1
17 AGE 345 HC9 58 LASTDATE
340 HC4 85 ETH2 29 SEC5
250 LFC7 411 CARDPM12 75 STATE
P a g e 27 | 100

329 POBC2 81 POP90C3 28 RFA_13
134 MARR4 410 NUMPROM 56 IC14
193 RP2 171 ETHC5 6 RFA_2A
304 AFC2 221 IC23 70 MALEMILI
221 IC23 412 NUMPRM12 55 OCC9
312 VC4 187 HUPA3 12 ODATEDW
262 OCC9 3 TCODE 54 IC23
212 IC14 399 RFA_16 40 AFC1
145 DW9 332 LSC3
152 HU3 397 RFA_14
290 EC1 290 EC1
465 LASTDATE 91 ETH8
84 ETH1 334 VOC1
249 LFC6 356 HC20
190 HUPA6 235 TPE7
407 RFA_24 174 HVP2
153 HU4 465 LASTDATE
222 HHAS1 231 TPE3
355 HC19 339 HC3
334 VOC1 368 ADATE_8
131 MARR1 401 RFA_18
343 HC7 327 ANC15
125 HHN1 87 ETH4
80 POP90C2 35 MAGMALE
267 EIC1 331 LSC2
344 HC8 28 HIT
44 MALEMILI 238 PEC1
245 LFC2 154 HU5
157 HHD3 190 HUPA6
287 OEDC5 177 HVP5
I then conducted overlapping (intersection) analysis of the attributes across above evaluator
methods. I then selected high ranked attributes across these evaluator results for selection of final
(45 attributes) listed below.
Final Attributes Attribute Description
IC5 Per Capita Income
ZIP Zipcode
POP901 Number of Persons in donor’s neighborhood, as collected from the 1990 US Census.
AVGGIFT Average dollar amount of gifts to date
HV1 Median Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
HV2 Average Home Value in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
P a g e 28 | 100

IC4 Average Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC3 Average Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC2 Median Family Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
IC1 Median Household Income in hundreds in donor’s neighborhood, as collected from the 1990 US Census.
RAMNTALL Dollar amount of lifetime gifts to date
DOB Date of birth of Donor
OSOURCE Code indicating which mailing list the donor was originally acquired from
RFA_4 Donor's RFA status as of 96TK promotion date from promotion history File
RFA_6 Donor's RFA status as of 96LL promotion date from promotion history File
RFA_8 Donor's RFA status as of 96GK promotion date from promotion history File
FISTDATE Date of first gift from giving history file
RFA_12 Donor's RFA status as of 96XK promotion date from promotion history File
MAXRAMNT Dollar amount of largest gift to date from giving history file
RFA_2 Donor's RFA status as of 97NK promotion date from promotion history File
RFA_11 Donor's RFA status as of 96X1 promotion date from promotion history File
RFA_2A Donation Amount code for RFA_2
RFA_2F Frequency code for RFA_2
HVP2 Percent Home Value >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
RFA_18 Donor's RFA status as of 95GK promotion date
WWIIVETS % WWII Vets
HHAS3
Percent Households w/ Interest, Rental or Dividend Income in donor’s neighborhood, as collected from the 1990
US Census.
HUR2 Percent >= 6 Room Housing Units in donor’s neighborhood, as collected from the 1990 US Census.
NGIFTALL Number of lifetime gifts to date from promotion history File
CARDGIFT Number of lifetime gifts to card promotions to date
LASTGIFT Dollar amount of most recent gift from giving history file
P a g e 29 | 100

AFC1 Percent Adults in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census.
AFC2 Percent Males in Active Military Service in donor’s neighborhood, as collected from the 1990 US Census.
IC23 Percent Families w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
IC14 Percent Households w/ Income >= $150,000 in donor’s neighborhood, as collected from the 1990 US Census.
EC1
Median Years of School Completed by Adults 25+ in donor’s neighborhood, as collected from the 1990 US
Census.
LASTDATE Date associated with the most recent gift
VOC1 Percent Households w/ 1+ Vehicles in donor’s neighborhood, as collected from the 1990 US Census.
POP90C2 Percent Population Outside Urbanized Area in donor’s neighborhood, as collected from the 1990 US Census.
P a g e 30 | 100

5.0 Models
5.1 Model-1
NaïveBayes 10 Fold Cross Validation Statistics
Algorithm: NaiveBayes algorithm was used.
Test Options:
• NaiveBayes algorithm with 10 fold cross validation settings.
Confusion Matrix: Following was the observed Confusion Matrix for the model generated.
a b Classified as
8,843 1,156 a = 0
1,178 275 b = 1
• TP - 8,843 non-donor instances were correctly identified as non-donors by the model.
• TN- 275 donor instances were correctly identified as donors by the model.
• FP- 1,178 donor instances were incorrectly identified as non-donors by the model.
• FN – 1,156 non-donor instances were incorrectly identified as donors by the model.
P a g e 31 | 100

5.2 Model-2:
NaiveBayes 10 Fold Cross Validation with Test Set Statistics
Test Options:
• In addition, supplied test set was used for running the model on test dataset.
a b Classified as
3265 458 a = 0
374 110 b = 1
• TP – 3,265 non-donor instances were correctly identified as non-donors by the model.
• FP- 374donor instances were incorrectly identified as non-donors by the model.
• FN – 458 non-donor instances were incorrectly identified as donors by the model.
P a g e 32 | 100

NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics
Test Options:
• In addition, supplied test set was used for running the model on evaluation dataset.
a b Classified as
3289 434 a = 0
367 117 b = 1
• FP- 367 donor instances were incorrectly identified as non-donors by the model.
Note:
• As we can see that the variation in accuracy between test dataset and evaluation dataset is
not very high.
• Hence model does not overfit the data.
P a g e 33 | 100

Other Models
5.3 Model-3
J48Graft Training Set Model with Test Set Statistics
Algorithm: J48Graft algorithm was used.
Test Options:
• J48Graft algorithm model based on training set.
a b Classified as
3723 0 a = 0
484 0 b = 1
P a g e 34 | 100

J48Graft Training Set Model with Evaluation Set Statistics
Algorithm: J48Graft algorithm was used.
Test Options:
• J48Graft algorithm model based on training set.
a b Classified as
3723 0 a = 0
484 0 b = 1
P a g e 35 | 100

Snapshot of Tree
The number of Non-donors in the training set for this algorithm were 11,452 and donors were
1453. When J48graft algorithm was ran the algorithm used number of non-donors as the root
node which did not create any further classification beyond what is shown in the figure below.
Conclusion:
Model classified the number of non-donors and donors same as specified in the dataset.
Model failed to calculate the True Negative and False Negative .
P a g e 36 | 100

5.4 Model-4
Decision Stump 10 Fold Cross Validation Model Statistics
Algorithm: Decision Stump algorithm was used.
Test Options:
• Decision Stump algorithm with 10 fold cross validation settings.
a b Classified as
9999 0 a = 0
1453 0 b = 1
P a g e 37 | 100

5.5 Model-5
Decision Stump 10 Fold Cross Validation Model with Test Set Statistics
Test Options:
a b Classified as
3732 0 a = 0
484 0 b = 1
• TP - 3732 non-donor instances were correctly identified as non-donors by the model.
• TN - 0 donor instances were correctly identified as donors by the model.
• FP - 484 donor instances were incorrectly identified as non-donors by the model.
• FN - 0 non-donor instances were incorrectly identified as donors by the model.
P a g e 38 | 100

Decision Stump Cross Validation Model with Evaluation Set Statistics
Test Options:
a b Classified as
3723 0 a = 0
484 0 b = 1
• TP – 3723 non-donor instances were correctly identified as non-donors by the model.
P a g e 39 | 100

Conclusion:
• Model classified the number of non-donors and donors same as specified in the dataset.
• Model failed to calculate the True Positive and False Negative.
5.6 Model-6
OneR 10 Fold Cross Validation Model Statistics
Algorithm: OneR algorithm was used.
Test Options:
• OneR algorithm with 10 fold cross validation settings.
a b Classified as
9571 428 a = 0
1406 47 b = 1
P a g e 40 | 100

5.7 Model-7
OneR 10 Fold Cross Validation Model with Test Set Statistics
Test Options:
a b Classified as
3564 159 a = 0
455 29 b = 1
P a g e 41 | 100

OneR 10 Fold Cross Validation Model with Evaluation Set Statistics
Test Options:
a b Classified as
3566 157 a = 0
454 30 b = 1
Note:
not very high.
P a g e 42 | 100

5.8 Model-8:
ZeroR 10 Fold Cross Validation Model Statistics
Algorithm: ZeroR algorithm was used.
Test Options:
• ZeroR algorithm with 10 fold cross validation settings.
a b Classified as
9999 0 a = 0
1453 0 b = 1
P a g e 43 | 100

5.9 Model-9
ZeroR 10 Fold Cross Validation Model with Test Set Statistics
Algorithm: ZeroR algorithm was used to generate model.
Test Options:
A B Classified as
3723 0 a = 0
484 0 b = 1
P a g e 44 | 100

ZeroR 10 Fold Cross Validation Model with Evaluation Set Statistics
Algorithm: ZeroR algorithm was used to generate model.
Test Options:
a b Classified as
3723 0 a = 0
484 0 b = 1
P a g e 45 | 100

Conclusion:
• Model classified the number of nondonors and donors same as specified in the dataset.
• Model failed to calculate the True Positive and False Negative.
6.0 Different number of attributes but same number of records
using NaiveBayes Model
Note: In all the below mentioned model TARGET_B attribute has been used.
6.1 5 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1 were the attributes used to create model.
NaiveBayes 10 Fold Cross Validation Statistics for 5 attributes
Algorithm: NaiveBayes algorithm was used to generate model.
Test Options: NaiveBayes algorithm with 10 crossfold settings.
Confusion Matrix: Following was the observed confusion matrix for the model generated.
P a g e 46 | 100

a b Classified as
9795 204 a = 0
1415 38 b = 1
NaiveBayes 10 Fold Cross Validation with Test Set Statistics for 5 attributes
Test Options:
• NaiveBayes algorithm with 10 crossfold settings.
a b Classified as
3649 74 a = 0
473 11 b = 1
P a g e 47 | 100

• TP –3649 non-donor instances were correctly identified as non-donors by the model.
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics for 5 attributes
Test Options:
a b Classified as
3651 72 a = 0
471 13 b = 1
P a g e 48 | 100

• TP –3651 non-donor instances were correctly identified as non-donors by the model.
Note:
not very high.
6.2 10 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1 were the attributes used to create
model.
NaiveBayes 10 Fold Cross Validation Statistics 10 attributes
P a g e 49 | 100

a b Classified as
9320 679 a = 0
1329 124 b = 1
Test Options:
P a g e 50 | 100

a b Classified as
3472 251 a = 0
438 46 b = 1
Test Options:
P a g e 51 | 100

a b Classified as
3465 258 a = 0
439 45 b = 1
• TP - 3465non-donor instances were correctly identified as non-donors by the model.
Note:
not very high.
6.3 15 Attributes
IC5, ZIP, POP901, AVGGIFT, HV1, HV2, IC4, IC3, IC2, IC1, RAMNTALL, DOB, OSOURCE,
RFA_4, RFA_6 were the attributes used to create model.
P a g e 52 | 100

a b Classified as
9493 506 a = 0
1351 102 b = 1
• TP - 9493non-donor instances were correctly identified as non-donors by the model.
• FP- 1351donor instances were incorrectly identified as non-donors by the model.
P a g e 53 | 100

Test Options:
a b Classified as
3525 198 a = 0
442 42 b = 1
P a g e 54 | 100

Test Options:
.
a b Classified as
3514 209 a = 0
445 39 b = 1
• FN - 209 non-donor instances were incorrectly identified as donors by model.
P a g e 55 | 100

Note:
not very high.
6.4 20 Attributes
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT were the attributes used
to create model.
NaiveBayes 10 Fold Cross Validation for 20 attributes
a b Classified as
9405 594 a = 0
1309 144 b = 1
P a g e 56 | 100

Test Options:
P a g e 57 | 100

a b Classified as
3477 246 a = 0
431 53 b = 1
Test Options:
P a g e 58 | 100

a b Classified as
3484 239 a = 0
431 53 b = 1
Note:
not very high.
6.5 25 Attributes
RFA_4, RFA_6 , RFA_8, RFA_3, FISTDATE, RFA_12, MAXRAMNT, RFA_2, RFA_9,
RFA_7, RFA_11, RFA_2A were the attributes used to create model.
P a g e 59 | 100

a b Classified as
9066 933 a = 0
1239 214 b = 1
P a g e 60 | 100

Test Options:
.Confusion Matrix: Following was the observed confusion matrix for the model generated.
a b Classified as
3350 373 a = 0
408 76 b = 1
P a g e 61 | 100

Test Options:
a b Classified as
3380 343 a = 0
386 98 b = 1
P a g e 62 | 100

Note:
not very high.
6.5 30 Attributes
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS were the attributes
used to create model.
NaiveBayes 10 Fold Cross Validation 30 attributes
a b Classified as
8970 1029 a = 0
1213 240 b = 1
P a g e 63 | 100

Test Options:
a b Classified as
3305 418 a = 0
391 93 b = 1
P a g e 64 | 100

• FP-391 donor instances were incorrectly identified as non-donors by the model.
Test Options:
a b Classified as
3332 391 a = 0
377 107 b = 1
P a g e 65 | 100

Note:
not very high.
6.6 35 Attributes
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2,
NGIFTALL, HVP5, CARDGIFT were the attributes used to create model.
NaiveBayes 10 Fold Cross Validation with 35 attributes
P a g e 66 | 100

a b Classified as
8836 1136 a = 0
1187 266 b = 1
NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 35 attributes
Test Options:
P a g e 67 | 100

a b Classified as
3274 449 a = 0
380 104 b = 1
NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 35 attributes
Test Options:
P a g e 68 | 100

a b Classified as
3296 427 a = 0
372 112 b = 1
Note:
not very high.
6.7 40 Attributes
RFA_7, RFA_11, RFA_2A, RFA_2F, HVP2, HVP6, RFA_18, WWIIVETS, HHAS3, HUR2,
NGIFTALL, HVP5, CARDGIFT, LASTGIFT, AFC1, AFC2, IC23, IC14 were the attributes used
to create model.
P a g e 69 | 100

NaiveBayes 10 Fold Cross Validation Statistics with 40 attributes
a b Classified as
8835 1164 a = 0
1185 268 b = 1
P a g e 70 | 100

NaiveBayes 10 Fold Cross Validation with Test Set Statistics with 40 attributes
Test Options:
a b Classified as
3259 464 a = 0
374 110 b = 1
P a g e 71 | 100

NaiveBayes 10 Fold Cross Validation with Evaluation Set Statistics with 40 attributes
Test Options:
a b Classified as
3290 433 a = 0
368 116 b = 1
P a g e 72 | 100

Note:
not very high.
P a g e 73 | 100

7.0 Performance Metrics
7.1 Calculations for Each Model – Precision and Sensitivity
Model -
Number
Model Description
0 1 0 1
a b Classified as TP 8,843 275 TP 8,843 275
8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178
1,178 275 b = 1 PVV 0.882 0.192 Recall 0.884 0.189
0 1 0 1
3265 458 a = 0 FP 374 458 FN 458 374
374 110 b = 1 PVV 0.897 0.194 Recall 0.877 0.227
0 1 0 1
3289 434 a = 0 FP 367 434 FN 434 367
367 117 b = 1 PVV 0.900 0.212 Recall 0.883 0.242
0 1 0 1
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
9571 428 a = 0 FP 1,406 428 FN 428 1,406
1406 47 b = 1 PVV 0.872 0.099 Recall 0.957 0.032
0 1 0 1
3564 159 a = 0 FP 455 159 FN 159 455
455 29 b = 1 PVV 0.887 0.154 Recall 0.957 0.060
0 1 0 1
3566 157 a = 0 FP 454 157 FN 157 454
454 30 b = 1 PVV 0.887 0.160 Recall 0.958 0.062
5.7 Model-7
One R Cross
Validation Model with
Test Set Statistics
One R Cross
Evaluation Set Statistics
5.6 Model-6
OneRCross Validation
Model Statistics
5.5 Model-5
Decision Stump Cross
test set statistics
Evaluate set statistics
J48Graft Training Set
Model with Test set
statistics
Model with Evaluate set
statistics
5.3 Model-3
Validation Model
Statistics
5.4 Model-4
5.1 Model-1
NaïveBayes Ten Fold
Cross Validation
Statistics
Confusion Matrix
5.2 Model-2
NaiveBayes 10 Fold
Cross validation with
Test set statistics.
NaiveBayes 10 Fold
Evaluation set statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝑷𝑽 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)
P a g e 74 | 100

Model -
Number
Model Description
0 1 0 1
9999 0 a = 0 FP 1,453 0 FN 0 1,453
1453 0 b = 1 PVV 0.873 #DIV/0! Recall 1.000 0.000
0 1 0 1
a b Classified as
TP 3,723 0 TP 3,723 0
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 PVV 0.885 #DIV/0! Recall 1.000 0.000
0 1 0 1
9795 204 a = 0 FP 1,415 204 FN 204 1,415
1415 38 b = 1 PVV 0.874 0.157 Recall 0.980 0.026
0 1 0 1
3649 74 a = 0 FP 473 74 FN 74 473
473 11 b = 1 PVV 0.885 0.129 Recall 0.980 0.023
0 1 0 1
3651 72 a = 0 FP 471 72 FN 72 471
471 13 b = 1 PVV 0.886 0.153 Recall 0.981 0.027
0 1 0 1
9320 679 a = 0 FP 1,329 679 FN 679 1,329
1329 124 b = 1 PVV 0.875 0.154 Recall 0.932 0.085
0 1 0 1
3472 251 a = 0 FP 438 251 FN 251 438
438 46 b = 1 PVV 0.888 0.155 Recall 0.933 0.095
0 1 0 1
3465 258 a = 0 FP 439 258 FN 258 439
439 46 b = 1 PVV 0.888 0.151 Recall 0.931 0.095
0 1 0 1
9493 506 a = 0 FP 1,351 506 FN 506 1,351
1351 102 b = 1 PVV 0.875 0.168 Recall 0.949 0.070
0 1 0 1
3525 198 a = 0 FP 442 198 FN 198 442
442 42 b = 1 PVV 0.889 0.175 Recall 0.947 0.087
0 1 0 1
3514 209 a = 0 FP 445 209 FN 209 445
445 39 b = 1 PVV 0.888 0.157 Recall 0.944 0.081
NaiveBayes Cross
15 Attributes
5.14 Model-14
NaiveBayes Cross
Test Set with 15
Attributes
NaiveBayes Cross
Evaluation Set with 15
Attributes
5.15 Model-15
NaiveBayes Cross
10 Attributes
NaiveBayes Cross
Test Set with 10
Attributes
5.12 Model-12
NaiveBayes Cross
Attributes
5.13 Model-13
NaïveBayes Cross
Validation Model with 5
Attributes
5.10 Model-10
NaïveBayes Cross
Test Set Statistics with
5 Attributes
NaiveBayes Cross
with 5 Attributes
5.11 Model-11
5.8 Model-8
ZeroR Cross Validation
Model Statistics
5.9 Model-9
Model with Test Set
Statistics
Model with Evaluation
Set Statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑻𝑷
𝑻𝑷
P a g e 75 | 100

Model -
Number
Model Description
0 1 0 1
9405 594 a = 0 FP 1,309 594 FN 594 1,309
1309 144 b = 1 PVV 0.878 0.195 Recall 0.941 0.099
0 1 0 1
3477 246 a = 0 FP 431 246 FN 246 431
431 53 b = 1 PVV 0.890 0.177 Recall 0.934 0.110
0 1 0 1
3484 239 a = 0 FP 431 239 FN 239 431
431 53 b = 1 PVV 0.890 0.182 Recall 0.936 0.110
0 1 0 1
9066 933 a = 0 FP 1,239 933 FN 933 1,239
1239 214 b = 1 PVV 0.880 0.187 Recall 0.907 0.147
0 1 0 1
3350 373 a = 0 FP 408 373 FN 373 408
408 76 b = 1 PVV 0.891 0.169 Recall 0.900 0.157
0 1 0 1
3380 343 a = 0 FP 386 343 FN 343 386
386 98 b = 1 PVV 0.898 0.222 Recall 0.908 0.202
0 1 0 1
8970 1029 a = 0 FP 1,213 1,029 FN 1,029 1,213
1213 240 b = 1 PVV 0.881 0.189 Recall 0.897 0.165
0 1 0 1
3305 418 a = 0 FP 391 418 FN 418 391
391 93 b = 1 PVV 0.894 0.182 Recall 0.888 0.192
0 1 0 1
3332 391 a = 0 FP 377 391 FN 391 377
377 107 b = 1 PVV 0.898 0.215 Recall 0.895 0.221
0 1 0 1
8863 1136 a = 0 FP 1,187 1,136 FN 1,136 1,187
1187 266 b = 1 PVV 0.882 0.190 Recall 0.886 0.183
0 1 0 1
3274 449 a = 0 FP 380 449 FN 449 380
380 104 b = 1 PVV 0.896 0.188 Recall 0.879 0.215
0 1 0 1
3296 427 a = 0 FP 372 427 FN 427 372
372 112 b = 1 PVV 0.899 0.208 Recall 0.885 0.231
0 1 0 1
A B Classified as TP 8,835 268 TP 8,835 268
8835 1164 a = 0 FP 1,185 1,164 FN 1,164 1,185
1185 268 b = 1 PVV 0.882 0.187 Recall 0.884 0.184
0 1 0 1
3259 464 a = 0 FP 374 464 FN 464 374
374 110 b = 1 PVV 0.897 0.192 Recall 0.875 0.227
0 1 0 1
3290 433 a = 0 FP 368 433 FN 433 368
368 116 b = 1 PVV 0.899 0.211 Recall 0.884 0.240
NaiveBayes Cross
40 Attributes
NaiveBayes Cross
Test Set with 40
Attributes
NaiveBayes Cross
Attributes
5.24 Model-24
5.25 Model-25
NaiveBayes Cross
35 Attributes
NaiveBayes Cross
Test Set with 35
Attributes
NaiveBayes Cross
Attributes
5.22 Model-22
5.23 Model-23
NaiveBayes Cross
30 Attributes
5.20 Model-20
NaiveBayes Cross
Test Set with 30
Attributes
NaiveBayes Cross
Attributes
5.21 Model-21
NaiveBayes Cross
25 Attributes
NaiveBayes Cross
Test Set with 25
Attributes
NaiveBayes Cross
Attributes
5.18 Model-18
5.19 Model-19
NaiveBayes Cross
20 Attributes
5.16 Model-16
NaiveBayes Cross
Test Set with 20
Attributes
NaiveBayes Cross
Attributes
5.17 Model-17
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑻𝑷
𝑻𝑷
P a g e 76 | 100

7.2 Calculations for Each Model – Specificity and NPV
Model -
Number
Model Description
0 1 0 1
a b Classified as TN 275 8,843 TN 275 8,843
8,843 1,156 a = 0 FP 1,178 1,156 FN 1,156 1,178
1,178 275 b = 1 Specificity 0.189 0.884 NPV 0.192 0.882
0 1 0 1
3265 458 a = 0 FP 374 458 FN 458 374
374 110 b = 1 Specificity 0.227 0.877 NPV 0.194 0.897
0 1 0 1
3289 434 a = 0 FP 367 434 FN 434 367
0 1 0 1
3723 0 a = 0 FP 484 0 FN 0 484
484 0 b = 1 Specificity 0.000 1.000 NPV #DIV/0! 0.885
0 1 0 1
3723 0 a = 0 FP 484 0 FN 0 484
0 1 0 1
9999 0 a = 0 FP 1,453 0 FN 0 1,453
0 1 0 1
9999 0 a = 0 FP 1,453 0 FN 0 1,453
0 1 0 1
9999 0 a = 0 FP 1,453 0 FN 0 1,453
0 1 0 1
9571 428 a = 0 FP 1,406 428 FN 428 1,406
0 1 0 1
3564 159 a = 0 FP 455 159 FN 159 455
0 1 0 1
3566 157 a = 0 FP 454 157 FN 157 454
5.7 Model-7
One R Cross
Test Set Statistics
One R Cross
5.6 Model-6
OneRCross Validation
Model Statistics
5.5 Model-5
test set statistics
Evaluate set statistics
Model with Test set
statistics
Model with Evaluate set
statistics
5.3 Model-3
Validation Model
Statistics
5.4 Model-4
5.1 Model-1
NaïveBayes Ten Fold
Cross Validation
Statistics
Confusion Matrix
5.2 Model-2
NaiveBayes 10 Fold
Test set statistics.
NaiveBayes 10 Fold
Evaluation set statistics
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
Confusion Matrix
𝑵𝑷𝑽 =
𝑻𝑵
(𝑻𝑵 + 𝑭𝑵)
𝑺𝒑𝒆𝒄𝒊𝒇𝒊𝒄𝒊𝒕𝒚 =
𝑻𝑵
(𝑻𝑵 + 𝑭𝑷)
P a g e 77 | 100

Veterans_DataMining

Recommended

Recommended

More Related Content

Similar to Veterans_DataMining

Similar to Veterans_DataMining (20)

Veterans_DataMining