SlideShare a Scribd company logo
1 of 26
1
Predicting Likely Donors
and
Donation Amounts
Predictive Analytics
Michele Vincent
March 22, 2017
2
Goals
• Predict likely donors using classification models
• Predict how much donation will likely donors give using regression
models
• Validate predictive models by measuring how effective the models
are
Objectives
3
Training Data
• Filename: cup98lrn variable
subset small.txt
• # Records: 50% of 95,412 (47,706)
• Target Variable: TARGET_B
– Classification (Y/N decision, a
donor or not)
– 5% responders
Testing Data
• Filename: cup98lrn variable
subset small.txt
• # Records: 50% of 95,412 (47,706)
• Target Variable: TARGET_B
– Classification (Y/N decision, a
donor or not)
– 5% responders
Data Used for Training and
Testing of Classification Models
TARGET_B is binary indicator for response to 97NK Mailing
4
Validation Data
• Filename: cup98val_variable_subset_small.csv
• # Records: 96,367
• Target Variable: TARGET_B
– Classification (Y/N decision, a donor or not)
– 5% responders
Data Used for Validation of Classification Models
TARGET_B is binary indicator for response to 97NK Mailing
55 5
 Based on AUC, the best model is the Logistic Regression model which generates the highest
AUC. It correctly classified 58.7% which is not the highest but it is one of the highest. It’s
precision is 2nd highest. The lift it provides at 10% or 70% is one of the highest among all
models tested.
Classification Model: Accuracy
• Best Performing Algorithm
Stratified and equal size sampling were used for all models tested below.
66 6
 The accuracy rate of the Logistic Regression is 58.7%. This means we correctly classified 58.7%
of the file. If we had 100 records, there were 58 that we correctly classified as non-donors and
donors.
 The precision means how many donors did we get right out of the total that we predicted as
donors. If 100 donors were predicted, but only 7 of them are actual donors, then the precision
is 7%.
 The recall means how many donors did we get right out of the total actual donors. If there were
100 actual donors, and we predicted 58 of them correctly, then the recall rate is 58%.
Sensitivity is also recall.
 False alarm means we thought a person is a donor, but he wasn’t. If there were 100 non-donors
and we claimed 41 of them to be donors, then our false alarm rate is 41%. In this example, we
would claim that 59 are not donors. This means that the Specificity is 59%.
Classification Model: Accuracy (Cont’d)
• Logistic Regression
77 7
Classification Model: Accuracy (Cont’d)
• ROC Curve for the winning model (Logistic Regression):
 ROC curve shows an area under the curve of
0.6110 (which is the biggest area under the curve
among other models).
 This curve is also useful for knowing what true
alarm rate we can get given an accepted false
alarm rate.
 If we are willing to accept a 37% false alarm rate
we can get a true alarm rate of 55% (dotted line
on the graph). This means that if we are lenient
and allow the model to make a mistake of
classifying 37 donors when they are not actual
donors, then the model can get us 55 donors who
are actual donors.
False Alarm
TrueAlarm
88 8
Classification Model: Accuracy (Cont’d)
• Lift Chart for the winning model (Logistic Regression)
 The lift at 10% of the file is 1.815; at
70%, the cumulative lift is 1.12.
 This means that if we mailed to the top
10% of the file which contains the
predictions with highest probabilities,
we can get 1.8 more donors than just
1.0 if we do not use a model.
y-axis shows 1.815 (lift) and x-axis shows 10 (percentage of file).
99 9
Classification Model: Accuracy (Cont’d)
• Histogram for the winning model (Logistic Regression)
 The histogram of the distribution of
predictions shows:
• 921 are predicted with probability to respond between 0.507
and 0.608
• 429 are predicted with probability to respond between 0.608
and 0.709
 This means that if we are comfortable
mailing only to those with probability
greater than 0.6, then we can expect at least
429 donors with that probability to respond.
429
921
748
305
101010
 We validated the results of our best model using a different set of data. The results here are
very close to results previously discussed.
Classification Model: Accuracy (Cont’d)
• Classification Accuracy
Using Validation Data
111111
Classification Model: Interpretation
• Best Predictors
 RFA_2F* was chosen by five models to be their top predictor for a donor.
If we were to differentiate between a non-donor and a donor, RFA_2F is
the best variable to use.
 E_RFA_2A was chosen by three models to be one of their top three
predictors.
 FISTDATE and NGIFTALL were chosen by two models to be one of their
top three predictors.
* Frequency code for donor’s RFA status as of 97NK promotion date
121212
Classification Model: Interpretation (Cont’d)
• Details of Best Predictors for Some of the Models Used:
 Three common predictors appeared among the top predictors from
Logistic Regression, Neural Networks and kNN = 101 models:
 RFA_2F
 E_RFA_2A
 FISTDATE
131313
Classification Model: Interpretation (Cont’d)
• Relationship of Predictors to Target Variable:
 LASTGIFT’s P-value is not significant, and therefore, not
an influential variable. It does not matter how much a
donor gave last time. The amount the donor gave does not
help in predicting whether he will be a donor again.
 FISTDATE and DOMAIN3 have negative relationship with
the target variable. The smaller (or less recent) the
FISTDATE is, the more likely they are to be a responder.
The likely donor is someone who has not given recently,
and is not from the lowest socio-economic status.
 RFA_2F, D_RFA_2A, E_RFA_2A, DOMAIN1 have
positive relationship with the target variable. The bigger
these variables are, the more likely that the outcome of
Donor=Y is true. The likely donor is someone who is a
frequent giver, and comes from the highest socio-economic
status.
 The D_RFA_2A has higher coefficient than other
predictors which means D_RFA_2A has larger influence
on our prediction that someone is a donor than other
predictors. So, the donor’s RFA status as of the 97K
promotion is more influential than his RFA status as of the
96NK or 95NK promotions.
Donor=Y
141414
Classification Model: Conclusion
• The donor’s frequency of giving is the most influential variable to
determine a donor and a non-donor.
• The likely donor is someone who has not given recently but is a frequent
giver, and comes from the highest socio-economic status.
• If the results of the Logistic Regression model is implemented in the next
campaign, and we know the model gives a 1.2 cumulative lift at 70% of the
file, we can expect to gain out of 70,000 mailings 4,200 responders. We’ll
get a higher response rate of 6% instead of 5% without a model, and we’ll
save some money from the cost of 30,000 mailings.
15
Training Data
• Filename: cup98lrn variable
subset small responders.txt
• # Records: 50% of 4,873
• Target Variable: TARGET_D
– Average Continuous Value
(donation amount)
– 5% responders
Testing Data
• Filename: cup98lrn variable
subset small responders.txt
• # Records: 50% of 4,873
• Target Variable: TARGET_D
– Average Continuous Value
(donation amount)
– 5% responders
Data Used for Training and Testing of
Regression Models
TARGET_D is donation amount associated with
the response to 97NK Mailing
16
Validation Data
• Filename: cup98val_variable_subset_small.csv
• # Records: 96,367
• Target Variable: TARGET_D
– Average Continuous Value (donation amount)
– 5% responders
Data Used for Validation of Regression Models
TARGET_D is donation amount associated with
the response to 97NK Mailing
171717
Regression Model: Accuracy
 Using original target variable, SVM wins. The model SVM is the best model having the highest
R-squared value. SVM also has the smallest mean absolute error, mean squared error and root
mean squared deviation than Linear Regression and Neural Net.
 Using transformed target variable, LR wins (although SVM is close) based on R-square.
Lowest
Highest
Lowest
LowestHighest
Lowest (tied
with SVM)
• Best Performing Algorithm
181818
Regression Model: Accuracy (Cont’d)
Final Project
 While there is not much difference between the models, SVM provided
the highest lift of 2.329 in the top decile with overall average donation of
$16 and total dollar amount of $8,921 using transformed target variable.
• Best Performing Algorithm
Highest
Lift
Highest
Total
Amount
Average
Donation
Top
Decile
191919
Regression Model: Accuracy (Cont’d)
• Regression Accuracy (Results on Validation Data)
Highest
Lift
Highest
Total
Amount
Average
Donation
Top
Decile
 Using the validation data of 96k responders and non-responders with the
target variable transformed, SVM provided the highest lift among other
models. The lift is 1.692 in the top decile with overall average donation of
$0.79 and total dollar amount of $12,330.
 This tells us that if we mailed to the top decile, we can expect $12,330.
202020
Regression Model: Interpretation
• Best Predictors from SVM:
 Shown above is the output from KNIME from Linear Correlation to find
best predictors. The best predictors have the highest prediction values.
The output is sorted by prediction values. The top two variables given the
highest prediction values are LASTGIFT (transformed into log10) with
prediction probability of 0.971, and AVGGIFT with prediction probability
of 0.644.
 It makes sense that to be able to predict how much will be donated, it’s
important to consider first how much was the last donation, and how
much is the average donation so far for a particular donor.
212121
Regression Model: Interpretation (Cont’d)
• Best Predictors from SVM Model:
 The scatter plot of one of the best
predictors from SVM, AVGGIFT, and the
target variable, TARGET_D is shown on
the left. It shows that there is a linear
relationship between AVGGIFT and
TARGET_D.
 It shows that as the average gift of a
donor increases, the donation amount
that he will give also increases.
222222
Regression Model: Interpretation (Cont’d)
• Best Predictors from Linear Regression:
 The best predictors chosen by Linear
Regression based on significant P-values are
LASTGIFT, RFA_2F, F_RFA_2A, E_RFA_2A,
G_RFA_2A.
 LASTGIFT has a positive coefficient which
means it is positively related to the target
variable. The larger the last gift of a donor, the
larger the probability that he will give again. If
he gave a lot before, he is likely to give again.
 The RFA_2F is negatively related to the target
variable. This means that the more frequently
a donor gave before, the smaller his donation.
 F_RFA_2A, E_RFA_2A, G_RFA_2A are
positively related to the target variable. They
add 0.086, 052 and 0.13, respectively, to the
donation.
P-value
232323
Regression Model: Interpretation (Cont’d)
• Best Predictors from Linear Regression (Cont’d):
 The scatter plot of one of the best predictors
from Linear Regression, LASTGIFT, and
the target variable, TARGET_D is shown on
the left. It shows that there is a linear
relationship between LASTGIFT and
TARGET_D.
 Not only that it confirms what we found
earlier that the larger the last gift of a
donor, the larger the probability that he will
give again, the scatter plot gives us
additional knowledge – it shows that as the
last gift of a donor increases, the donation
amount that he will give also increases.
 The scatter plot shows the appropriateness
of the linearity of the regression function.
242424
Regression Model: Conclusion
• The donor’s last gift amount and average gift amount are the most
influential variables to determine how much would a donor donate again.
• If the results of the SVM model are implemented in the next campaign to
the list of responders only from the previous campaign, and we know the
model gives a cumulative lift of 2.329 in the top decile with overall average
donation of $16, we can expect to get a total donation of $8,921. Without a
model, the total donation is $3,822.
• If the results of the SVM model are implemented in the next campaign to
all responders and non-responders from the previous campaign, and we
know the model gives a cumulative lift of 1.692 in the top decile with
overall average donation of $0.79, we can expect to get a total donation of
$12,330.
252525
Conclusion
• My recommendation is to use the Logistic model if all we want is to identify
a donor and a non-donor. If we are also interested in the amount of
donation, my recommendation is to use the SVM model, although the Linear
Regression model will work just as well too.
• The validation dataset will provide the best estimate of the money the best
model will generate as it contains people who did not donate which is
realistic. The validation data will give us an overall average donation of
$1.28 (from $12339/9847 people in 1st decile) per person. However, if we
just want to get the highest average donation, the test data will give us $36
(from $8921/244 people in 1st decile) per person.
262626
Appendix
• Independent variables (predictor variables) used in the
models

More Related Content

Viewers also liked

Historia de la contabilidad
Historia de la contabilidadHistoria de la contabilidad
Historia de la contabilidad3106877009k
 
Teoria de los cuatro humores
Teoria de los cuatro humoresTeoria de los cuatro humores
Teoria de los cuatro humoresDyana Luna
 
Inspírate viajando
Inspírate viajandoInspírate viajando
Inspírate viajandoandre1989z
 
Luiza santos planche+numérique.compressed
Luiza santos planche+numérique.compressedLuiza santos planche+numérique.compressed
Luiza santos planche+numérique.compressedS_luiza
 
Dimensionamento de pilares e realces
Dimensionamento de pilares e realcesDimensionamento de pilares e realces
Dimensionamento de pilares e realcesAmanda Berçam
 
Planeación y organización
Planeación y organizaciónPlaneación y organización
Planeación y organizaciónItzelja
 
Anestesia inhalatoria bases, drogas y equipamiento.pdf
Anestesia inhalatoria bases, drogas y equipamiento.pdfAnestesia inhalatoria bases, drogas y equipamiento.pdf
Anestesia inhalatoria bases, drogas y equipamiento.pdfLaura González
 
Chapter 9 the 8088 instruction set vi - shift instructions
Chapter 9   the 8088 instruction set vi - shift instructionsChapter 9   the 8088 instruction set vi - shift instructions
Chapter 9 the 8088 instruction set vi - shift instructionsDwight Sabio
 
Comundades virtuales
Comundades virtuales Comundades virtuales
Comundades virtuales Fmp1128
 
Parte 3 subtema 2 power point
Parte 3 subtema 2 power pointParte 3 subtema 2 power point
Parte 3 subtema 2 power pointbrigittelerma18
 
Treatment of CA Ovary
Treatment of CA OvaryTreatment of CA Ovary
Treatment of CA OvaryAnil Gupta
 

Viewers also liked (19)

Historia de la contabilidad
Historia de la contabilidadHistoria de la contabilidad
Historia de la contabilidad
 
Teoria de los cuatro humores
Teoria de los cuatro humoresTeoria de los cuatro humores
Teoria de los cuatro humores
 
3 r on pen final
3 r on pen final 3 r on pen final
3 r on pen final
 
Bulletin 3 24-17
Bulletin 3 24-17Bulletin 3 24-17
Bulletin 3 24-17
 
Habitos orales
Habitos orales Habitos orales
Habitos orales
 
Tecnología de punta
Tecnología de puntaTecnología de punta
Tecnología de punta
 
43693417 patricia navarro rodriguez grupo 2
43693417 patricia navarro rodriguez grupo 243693417 patricia navarro rodriguez grupo 2
43693417 patricia navarro rodriguez grupo 2
 
Inspírate viajando
Inspírate viajandoInspírate viajando
Inspírate viajando
 
Luiza santos planche+numérique.compressed
Luiza santos planche+numérique.compressedLuiza santos planche+numérique.compressed
Luiza santos planche+numérique.compressed
 
Estequiometria
EstequiometriaEstequiometria
Estequiometria
 
Dimensionamento de pilares e realces
Dimensionamento de pilares e realcesDimensionamento de pilares e realces
Dimensionamento de pilares e realces
 
Ensayo argumentativo
Ensayo argumentativoEnsayo argumentativo
Ensayo argumentativo
 
Planeación y organización
Planeación y organizaciónPlaneación y organización
Planeación y organización
 
Anestesia inhalatoria bases, drogas y equipamiento.pdf
Anestesia inhalatoria bases, drogas y equipamiento.pdfAnestesia inhalatoria bases, drogas y equipamiento.pdf
Anestesia inhalatoria bases, drogas y equipamiento.pdf
 
Presentaciön villavicencio
Presentaciön villavicencioPresentaciön villavicencio
Presentaciön villavicencio
 
Chapter 9 the 8088 instruction set vi - shift instructions
Chapter 9   the 8088 instruction set vi - shift instructionsChapter 9   the 8088 instruction set vi - shift instructions
Chapter 9 the 8088 instruction set vi - shift instructions
 
Comundades virtuales
Comundades virtuales Comundades virtuales
Comundades virtuales
 
Parte 3 subtema 2 power point
Parte 3 subtema 2 power pointParte 3 subtema 2 power point
Parte 3 subtema 2 power point
 
Treatment of CA Ovary
Treatment of CA OvaryTreatment of CA Ovary
Treatment of CA Ovary
 

Similar to Predicting Donor Amounts Using Classification and Regression Models

Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
Marketing Engineering Notes
Marketing Engineering NotesMarketing Engineering Notes
Marketing Engineering NotesFelipe Affonso
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyaltyTheDataNation
 
Section 7 Analyzing our Marketing Test, Survey Results .docx
Section 7 Analyzing our Marketing Test, Survey Results .docxSection 7 Analyzing our Marketing Test, Survey Results .docx
Section 7 Analyzing our Marketing Test, Survey Results .docxkenjordan97598
 
Section 8 Ensure Valid Test and Survey Results Trough .docx
Section 8 Ensure Valid Test and Survey Results Trough .docxSection 8 Ensure Valid Test and Survey Results Trough .docx
Section 8 Ensure Valid Test and Survey Results Trough .docxkenjordan97598
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and ClusteringUsha Vijay
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationAsadJaved304231
 
Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modelingEsteban Ribero
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Module 7 Interval estimatorsMaster for Business Statistics.docx
Module 7 Interval estimatorsMaster for Business Statistics.docxModule 7 Interval estimatorsMaster for Business Statistics.docx
Module 7 Interval estimatorsMaster for Business Statistics.docxgilpinleeanna
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROAnthony Kilili
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learningSadia Zafar
 
Point and Interval Estimation
Point and Interval EstimationPoint and Interval Estimation
Point and Interval EstimationShubham Mehta
 

Similar to Predicting Donor Amounts Using Classification and Regression Models (20)

JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Marketing Engineering Notes
Marketing Engineering NotesMarketing Engineering Notes
Marketing Engineering Notes
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyalty
 
Section 7 Analyzing our Marketing Test, Survey Results .docx
Section 7 Analyzing our Marketing Test, Survey Results .docxSection 7 Analyzing our Marketing Test, Survey Results .docx
Section 7 Analyzing our Marketing Test, Survey Results .docx
 
Section 8 Ensure Valid Test and Survey Results Trough .docx
Section 8 Ensure Valid Test and Survey Results Trough .docxSection 8 Ensure Valid Test and Survey Results Trough .docx
Section 8 Ensure Valid Test and Survey Results Trough .docx
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
 
Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modeling
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Module 7 Interval estimatorsMaster for Business Statistics.docx
Module 7 Interval estimatorsMaster for Business Statistics.docxModule 7 Interval estimatorsMaster for Business Statistics.docx
Module 7 Interval estimatorsMaster for Business Statistics.docx
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
 
Chapter09
Chapter09Chapter09
Chapter09
 
MidTerm memo
MidTerm memoMidTerm memo
MidTerm memo
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
 
Point and Interval Estimation
Point and Interval EstimationPoint and Interval Estimation
Point and Interval Estimation
 

More from Michele Vincent

Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsMichele Vincent
 
resume_mchlvincent_jun2016
resume_mchlvincent_jun2016resume_mchlvincent_jun2016
resume_mchlvincent_jun2016Michele Vincent
 
Michele Vincent, Marketing Analytics Professional
Michele Vincent, Marketing Analytics ProfessionalMichele Vincent, Marketing Analytics Professional
Michele Vincent, Marketing Analytics ProfessionalMichele Vincent
 

More from Michele Vincent (6)

Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
 
resume_mchlvincent_jun2016
resume_mchlvincent_jun2016resume_mchlvincent_jun2016
resume_mchlvincent_jun2016
 
Michele Vincent, Marketing Analytics Professional
Michele Vincent, Marketing Analytics ProfessionalMichele Vincent, Marketing Analytics Professional
Michele Vincent, Marketing Analytics Professional
 
MRegressionAnalysis
MRegressionAnalysisMRegressionAnalysis
MRegressionAnalysis
 
dos_security_final
dos_security_finaldos_security_final
dos_security_final
 
srs_for_linkedin
srs_for_linkedinsrs_for_linkedin
srs_for_linkedin
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 

Predicting Donor Amounts Using Classification and Regression Models

  • 1. 1 Predicting Likely Donors and Donation Amounts Predictive Analytics Michele Vincent March 22, 2017
  • 2. 2 Goals • Predict likely donors using classification models • Predict how much donation will likely donors give using regression models • Validate predictive models by measuring how effective the models are Objectives
  • 3. 3 Training Data • Filename: cup98lrn variable subset small.txt • # Records: 50% of 95,412 (47,706) • Target Variable: TARGET_B – Classification (Y/N decision, a donor or not) – 5% responders Testing Data • Filename: cup98lrn variable subset small.txt • # Records: 50% of 95,412 (47,706) • Target Variable: TARGET_B – Classification (Y/N decision, a donor or not) – 5% responders Data Used for Training and Testing of Classification Models TARGET_B is binary indicator for response to 97NK Mailing
  • 4. 4 Validation Data • Filename: cup98val_variable_subset_small.csv • # Records: 96,367 • Target Variable: TARGET_B – Classification (Y/N decision, a donor or not) – 5% responders Data Used for Validation of Classification Models TARGET_B is binary indicator for response to 97NK Mailing
  • 5. 55 5  Based on AUC, the best model is the Logistic Regression model which generates the highest AUC. It correctly classified 58.7% which is not the highest but it is one of the highest. It’s precision is 2nd highest. The lift it provides at 10% or 70% is one of the highest among all models tested. Classification Model: Accuracy • Best Performing Algorithm Stratified and equal size sampling were used for all models tested below.
  • 6. 66 6  The accuracy rate of the Logistic Regression is 58.7%. This means we correctly classified 58.7% of the file. If we had 100 records, there were 58 that we correctly classified as non-donors and donors.  The precision means how many donors did we get right out of the total that we predicted as donors. If 100 donors were predicted, but only 7 of them are actual donors, then the precision is 7%.  The recall means how many donors did we get right out of the total actual donors. If there were 100 actual donors, and we predicted 58 of them correctly, then the recall rate is 58%. Sensitivity is also recall.  False alarm means we thought a person is a donor, but he wasn’t. If there were 100 non-donors and we claimed 41 of them to be donors, then our false alarm rate is 41%. In this example, we would claim that 59 are not donors. This means that the Specificity is 59%. Classification Model: Accuracy (Cont’d) • Logistic Regression
  • 7. 77 7 Classification Model: Accuracy (Cont’d) • ROC Curve for the winning model (Logistic Regression):  ROC curve shows an area under the curve of 0.6110 (which is the biggest area under the curve among other models).  This curve is also useful for knowing what true alarm rate we can get given an accepted false alarm rate.  If we are willing to accept a 37% false alarm rate we can get a true alarm rate of 55% (dotted line on the graph). This means that if we are lenient and allow the model to make a mistake of classifying 37 donors when they are not actual donors, then the model can get us 55 donors who are actual donors. False Alarm TrueAlarm
  • 8. 88 8 Classification Model: Accuracy (Cont’d) • Lift Chart for the winning model (Logistic Regression)  The lift at 10% of the file is 1.815; at 70%, the cumulative lift is 1.12.  This means that if we mailed to the top 10% of the file which contains the predictions with highest probabilities, we can get 1.8 more donors than just 1.0 if we do not use a model. y-axis shows 1.815 (lift) and x-axis shows 10 (percentage of file).
  • 9. 99 9 Classification Model: Accuracy (Cont’d) • Histogram for the winning model (Logistic Regression)  The histogram of the distribution of predictions shows: • 921 are predicted with probability to respond between 0.507 and 0.608 • 429 are predicted with probability to respond between 0.608 and 0.709  This means that if we are comfortable mailing only to those with probability greater than 0.6, then we can expect at least 429 donors with that probability to respond. 429 921 748 305
  • 10. 101010  We validated the results of our best model using a different set of data. The results here are very close to results previously discussed. Classification Model: Accuracy (Cont’d) • Classification Accuracy Using Validation Data
  • 11. 111111 Classification Model: Interpretation • Best Predictors  RFA_2F* was chosen by five models to be their top predictor for a donor. If we were to differentiate between a non-donor and a donor, RFA_2F is the best variable to use.  E_RFA_2A was chosen by three models to be one of their top three predictors.  FISTDATE and NGIFTALL were chosen by two models to be one of their top three predictors. * Frequency code for donor’s RFA status as of 97NK promotion date
  • 12. 121212 Classification Model: Interpretation (Cont’d) • Details of Best Predictors for Some of the Models Used:  Three common predictors appeared among the top predictors from Logistic Regression, Neural Networks and kNN = 101 models:  RFA_2F  E_RFA_2A  FISTDATE
  • 13. 131313 Classification Model: Interpretation (Cont’d) • Relationship of Predictors to Target Variable:  LASTGIFT’s P-value is not significant, and therefore, not an influential variable. It does not matter how much a donor gave last time. The amount the donor gave does not help in predicting whether he will be a donor again.  FISTDATE and DOMAIN3 have negative relationship with the target variable. The smaller (or less recent) the FISTDATE is, the more likely they are to be a responder. The likely donor is someone who has not given recently, and is not from the lowest socio-economic status.  RFA_2F, D_RFA_2A, E_RFA_2A, DOMAIN1 have positive relationship with the target variable. The bigger these variables are, the more likely that the outcome of Donor=Y is true. The likely donor is someone who is a frequent giver, and comes from the highest socio-economic status.  The D_RFA_2A has higher coefficient than other predictors which means D_RFA_2A has larger influence on our prediction that someone is a donor than other predictors. So, the donor’s RFA status as of the 97K promotion is more influential than his RFA status as of the 96NK or 95NK promotions. Donor=Y
  • 14. 141414 Classification Model: Conclusion • The donor’s frequency of giving is the most influential variable to determine a donor and a non-donor. • The likely donor is someone who has not given recently but is a frequent giver, and comes from the highest socio-economic status. • If the results of the Logistic Regression model is implemented in the next campaign, and we know the model gives a 1.2 cumulative lift at 70% of the file, we can expect to gain out of 70,000 mailings 4,200 responders. We’ll get a higher response rate of 6% instead of 5% without a model, and we’ll save some money from the cost of 30,000 mailings.
  • 15. 15 Training Data • Filename: cup98lrn variable subset small responders.txt • # Records: 50% of 4,873 • Target Variable: TARGET_D – Average Continuous Value (donation amount) – 5% responders Testing Data • Filename: cup98lrn variable subset small responders.txt • # Records: 50% of 4,873 • Target Variable: TARGET_D – Average Continuous Value (donation amount) – 5% responders Data Used for Training and Testing of Regression Models TARGET_D is donation amount associated with the response to 97NK Mailing
  • 16. 16 Validation Data • Filename: cup98val_variable_subset_small.csv • # Records: 96,367 • Target Variable: TARGET_D – Average Continuous Value (donation amount) – 5% responders Data Used for Validation of Regression Models TARGET_D is donation amount associated with the response to 97NK Mailing
  • 17. 171717 Regression Model: Accuracy  Using original target variable, SVM wins. The model SVM is the best model having the highest R-squared value. SVM also has the smallest mean absolute error, mean squared error and root mean squared deviation than Linear Regression and Neural Net.  Using transformed target variable, LR wins (although SVM is close) based on R-square. Lowest Highest Lowest LowestHighest Lowest (tied with SVM) • Best Performing Algorithm
  • 18. 181818 Regression Model: Accuracy (Cont’d) Final Project  While there is not much difference between the models, SVM provided the highest lift of 2.329 in the top decile with overall average donation of $16 and total dollar amount of $8,921 using transformed target variable. • Best Performing Algorithm Highest Lift Highest Total Amount Average Donation Top Decile
  • 19. 191919 Regression Model: Accuracy (Cont’d) • Regression Accuracy (Results on Validation Data) Highest Lift Highest Total Amount Average Donation Top Decile  Using the validation data of 96k responders and non-responders with the target variable transformed, SVM provided the highest lift among other models. The lift is 1.692 in the top decile with overall average donation of $0.79 and total dollar amount of $12,330.  This tells us that if we mailed to the top decile, we can expect $12,330.
  • 20. 202020 Regression Model: Interpretation • Best Predictors from SVM:  Shown above is the output from KNIME from Linear Correlation to find best predictors. The best predictors have the highest prediction values. The output is sorted by prediction values. The top two variables given the highest prediction values are LASTGIFT (transformed into log10) with prediction probability of 0.971, and AVGGIFT with prediction probability of 0.644.  It makes sense that to be able to predict how much will be donated, it’s important to consider first how much was the last donation, and how much is the average donation so far for a particular donor.
  • 21. 212121 Regression Model: Interpretation (Cont’d) • Best Predictors from SVM Model:  The scatter plot of one of the best predictors from SVM, AVGGIFT, and the target variable, TARGET_D is shown on the left. It shows that there is a linear relationship between AVGGIFT and TARGET_D.  It shows that as the average gift of a donor increases, the donation amount that he will give also increases.
  • 22. 222222 Regression Model: Interpretation (Cont’d) • Best Predictors from Linear Regression:  The best predictors chosen by Linear Regression based on significant P-values are LASTGIFT, RFA_2F, F_RFA_2A, E_RFA_2A, G_RFA_2A.  LASTGIFT has a positive coefficient which means it is positively related to the target variable. The larger the last gift of a donor, the larger the probability that he will give again. If he gave a lot before, he is likely to give again.  The RFA_2F is negatively related to the target variable. This means that the more frequently a donor gave before, the smaller his donation.  F_RFA_2A, E_RFA_2A, G_RFA_2A are positively related to the target variable. They add 0.086, 052 and 0.13, respectively, to the donation. P-value
  • 23. 232323 Regression Model: Interpretation (Cont’d) • Best Predictors from Linear Regression (Cont’d):  The scatter plot of one of the best predictors from Linear Regression, LASTGIFT, and the target variable, TARGET_D is shown on the left. It shows that there is a linear relationship between LASTGIFT and TARGET_D.  Not only that it confirms what we found earlier that the larger the last gift of a donor, the larger the probability that he will give again, the scatter plot gives us additional knowledge – it shows that as the last gift of a donor increases, the donation amount that he will give also increases.  The scatter plot shows the appropriateness of the linearity of the regression function.
  • 24. 242424 Regression Model: Conclusion • The donor’s last gift amount and average gift amount are the most influential variables to determine how much would a donor donate again. • If the results of the SVM model are implemented in the next campaign to the list of responders only from the previous campaign, and we know the model gives a cumulative lift of 2.329 in the top decile with overall average donation of $16, we can expect to get a total donation of $8,921. Without a model, the total donation is $3,822. • If the results of the SVM model are implemented in the next campaign to all responders and non-responders from the previous campaign, and we know the model gives a cumulative lift of 1.692 in the top decile with overall average donation of $0.79, we can expect to get a total donation of $12,330.
  • 25. 252525 Conclusion • My recommendation is to use the Logistic model if all we want is to identify a donor and a non-donor. If we are also interested in the amount of donation, my recommendation is to use the SVM model, although the Linear Regression model will work just as well too. • The validation dataset will provide the best estimate of the money the best model will generate as it contains people who did not donate which is realistic. The validation data will give us an overall average donation of $1.28 (from $12339/9847 people in 1st decile) per person. However, if we just want to get the highest average donation, the test data will give us $36 (from $8921/244 people in 1st decile) per person.
  • 26. 262626 Appendix • Independent variables (predictor variables) used in the models

Editor's Notes

  1. 5
  2. 6
  3. 7
  4. 8
  5. 9
  6. 10
  7. 11
  8. 12
  9. 13
  10. 14
  11. 17
  12. 18
  13. 19
  14. 20
  15. 21
  16. 22
  17. 23
  18. 24