Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Prospect Identification from a Credit Database using Regression, Decision Trees, And Neural Network

1,130 views

Published on

Identify prospects from a credit data set SMALL using data mining techniques

Data set: SMALL data set
• 145 Variables
• 8,000 observations

Tools Used:
• SAS Enterprise Miner Workstation 7.1
• SAS 9.3_M1

Steps involved:
• Data Quality Check
• Data Partition - TRAIN/ VALIDATE/ TEST
• Mining using Decision Trees - CHAID/ Pruned CHAID/ CART/ C4.5
• Data Mining using Regression - Forward/ Backward/ Stepwise
• Data Mining using Regression with Interaction terms included
• Data Mining using Neural Network
• Model Comparison and Scoring

Final Model Selection Analysis based on:
• LIFT Chart
• ROC Curve

Published in: Marketing, Technology
  • Login to see the comments

Prospect Identification from a Credit Database using Regression, Decision Trees, And Neural Network

  1. 1. Data  Mining  Project   By:  Akanksha  Jain  
  2. 2. Project  Goals   •  Goal  :  Using  historical  credit  data  set  SMALL,  develop  a  model  which  can   predict  whether  the  prospect  will  respond  to  a  markeBng  campaign  in   future   •  Scope:  SMALL  data  set   –  145  Variables   –  8000  observaBons   –  Dependent  Variable:  RESP_FLG  (Binary)   •  Responder:  1   •  Non-­‐Responder:  0  
  3. 3. Tools     •  SAS  Enterprise  Miner  WorkstaBon  7.1   •  SAS  9.3_M1  
  4. 4. Variable   Variable  DefiniBon   Defini,on   Type   AAl01  –  AAL17   All  Types   Char   AAU01  -­‐  AAU07   Auto   Char   ABK01  -­‐  ABK15   Bankcard   Char   ACE01  -­‐  ACE03   Cust  Elim   Char   ACL02  –  ACL12   CollecBon   Char   ADI01  –  ADI09   Derog  By  Ind   Char   AEQ01  -­‐  AEQ07   Home  Equity   Char   AHI01  -­‐  AHI05   Historical   Char   AIN01  -­‐  AIN15   Installmnt   Char   AIQ01  -­‐  AIQ05   Inquiries   Char   ALE01  -­‐  ALE07   Lease   Char   ALN01  -­‐  ALN07   LN  Finance   Char   AMG01  -­‐  AMG07   Mortgage   Char   APR17  -­‐  APR21   Public  REC   Char   ART01  -­‐  ART15   Retail   Char   ARV01  -­‐  ARV15   Revolving   Char   CUS04   Customer  Data   Char   SCORE01   FICO   Num   SCORE02   MDS  (Market  DeriveD  SignalS)   Num   RESP_FLG   Responder  Flag   Num  
  5. 5. Data  Cleaning   •  Dataset  SMALL  has  missing  values  for  Variables  SCORE01  (FICO)  and   SCORE02  (MDS)       data  mylib.small_clean  mylib.small_bad;    set  mylib.small;    if  score01  =  .  or  score02  =  .  then  output  mylib.small_bad;    else  output  mylib.small_clean;   run;     LOG:     NOTE:  There  were  8000  observaBons  read  from  the  data  set  MYLIB.SMALL.   NOTE:  The  data  set  MYLIB.SMALL_CLEAN  has  5782  observa:ons  and  145  variables.   NOTE:  The  data  set  MYLIB.SMALL_BAD  has  2218  observa:ons  and  145  variables.   •  Going  forward,  will  use  dataset  SMALL_CLEAN   •  InvesBgate  separately  why  2218  observaBons  had  missing  values  for   SCORE01  and  SCORE02        
  6. 6. Diagram  
  7. 7. Data  Source   •  Rejected  Variables  (have  more  than  20  categories):   –  ACE03   –  ACL10   •  Variable  RESP_FLG   –  Change  Role  to  TARGET   –  Change  Order  to  DESCENDING   •  Set  Prior  ProbabiliBes     –  Non  –  Responder/  event  =  “0”:  0.99   –  Responder/  event  =  “1”:  0.01  
  8. 8. Data  ParBBon   •  Train  –  55%   •  Validate  –  35%   •  Test  –  10%  
  9. 9. Model:  Maximum  CHAID   •  Nominal  Criterion:  ProbChiSq   •  Significance  Level:  0.2  
  10. 10. Model:  Maximum  CHAID   On  the  Lel  side,  the  percentage  of  1’s  i.e.  Respondents  is  higher,  and  hence  people  with  FICO   score  <  700.5,  in  (0,  4,  5,  Missing)  category  of  RETAIL:  BAL  >  0  IN  6  MNTHS,  ALL  will  respond  to   the  markeBng  campaign  
  11. 11. Maximum  CHAID:  CumulaBve  LIFT  
  12. 12. Maximum  CHAID:  Final  Variables  
  13. 13. Model:  Pruned  CHAID   •  •  •  •  •    Nominal  Criterion:  ProbChiSq   Significance  Level:  0.2   Leaf  Size:  120   Split  Size:  300   Maximum  Depth:  3  
  14. 14. Model:  Pruned  CHAID  
  15. 15. Pruned  CHAID:  CumulaBve  LIFT  
  16. 16. Pruned  CHAID:  Final  Variables    
  17. 17. Model:  CART   •  •  •  •  •      Nominal  Criterion:  Gini   Significance  Level:  0.2   Leaf  Size:  120   Split  Size:  300   Maximum  Depth:  3  
  18. 18. CART:  Tree  
  19. 19. CART:  CumulaBve  LIFT  
  20. 20. CART:  Final  Variables  
  21. 21. Model:  C4.5   •  •  •  •  •      Nominal  Criterion:  Entropy   Significance  Level:  0.2   Leaf  Size:  120   Split  Size:  300   Maximum  Depth:  3  
  22. 22. C4.5:  Tree  
  23. 23. C4.5:  CumulaBve  LIFT  
  24. 24. C4.5:  Final  Variables  
  25. 25. Variable  Comparison   Maximum  CHAID   Pruned  CHAID   CART   SCORE01   AMG07   ABK10   ART11   AAL04   AIQ04   ACE02   AAU03   AAL14   AEQ07   AEQ01   AIN03   ABK14   AIN10   SCORE01   AMG07   ALN01   SCORE01   AMG07   ABK10   SCORE02   AMG06   AMG01   AMG03   ARV10   ARV03   ARV01  
  26. 26. Transform  Variables   SCORE01   SCORE02  
  27. 27. Transform  Variables   •  Skewed  SCORE01  and  SCORE02   •  Transform  funcBon  –  LOG    
  28. 28. Impute   •  Default  Input  Methods     –  For  Interval  Variables  –  Median   –  For  Class  Variables  -­‐  Count  
  29. 29. Model:  Event  ‘0’  
  30. 30. Model:  Full  Model  Regression   •  •  •  •  •  •  Input  Coding  -­‐  GLM   MODEL  SELECTION  -­‐  None   OPTIMIZATION  OPTIONS  -­‐  TECHNIQUE  -­‐  Default   OPTIMIZATION  OPTIONS  -­‐  DEFAULT  OPTIMIZATION  -­‐  No   OPTIMIZATION  OPTIONS  -­‐  MAX  ITERATIONS  -­‐  20   OPTIMIZATION  OPTIONS  -­‐  MAX  FUNCTION  CALLS  -­‐  10  
  31. 31. Full  Model  Regression:  CumulaBve  LIFT  
  32. 32. Model:  Forward  Regression   •  •  •  •  •  •  •  •    Input  Coding  -­‐  GLM   MODEL  SELECTION  -­‐  Forward   SELECTION  CRITERION  -­‐  Akaike  InformaBon  Criterion   USE  SELECTION  DEFAULTS  -­‐  YES   OPTIMIZATION  OPTIONS  -­‐  TECHNIQUE  -­‐  Default   OPTIMIZATION  OPTIONS  -­‐  DEFAULT  OPTIMIZATION  -­‐  No   OPTIMIZATION  OPTIONS  -­‐  MAX  ITERATIONS  -­‐  20   OPTIMIZATION  OPTIONS  -­‐  MAX  FUNCTION  CALLS  -­‐  10  
  33. 33. Forward  Regression:  CumulaBve  LIFT  
  34. 34. Forward  Regression:  CumulaBve  %   Captured  Response  
  35. 35. Forward  Regression:  Final  Variables  
  36. 36. Model:  Backward  Regression   •  •  •  •  •  •  •  •    Input  Coding  -­‐  GLM   MODEL  SELECTION  -­‐  Backward   SELECTION  CRITERION  -­‐  Akaike  InformaBon  Criterion   USE  SELECTION  DEFAULTS  -­‐  YES   OPTIMIZATION  OPTIONS  -­‐  TECHNIQUE  -­‐  Default   OPTIMIZATION  OPTIONS  -­‐  DEFAULT  OPTIMIZATION  -­‐  No   OPTIMIZATION  OPTIONS  -­‐  MAX  ITERATIONS  -­‐  20   OPTIMIZATION  OPTIONS  -­‐  MAX  FUNCTION  CALLS  -­‐  10  
  37. 37. Backward  Regression:  CumulaBve  LIFT  
  38. 38. Backward  Regression:  CumulaBve  %   Captured  Response  
  39. 39. Backward  Regression:  Final  Variables  
  40. 40. Model:  Stepwise  Regression   •  •  •  •  •  •  •  •  •  Input  Coding  -­‐  GLM   MODEL  SELECTION  -­‐  Stepwise   SELECTION  CRITERION  -­‐  Akaike  InformaBon  Criterion   USE  SELECTION  DEFAULTS  -­‐  No   MODEL  SELECTION  -­‐  SELECTION  OPTIONS     –  ENTRY  SIGNIFICANCE  LEVEL  =  0.15   –  STAY  SIGNIFICANCE  LEVEL  =  0.05   –  MAXIMUM  NUMBER  OF  STEPS  =  300   OPTIMIZATION  OPTIONS  -­‐  TECHNIQUE  -­‐  Default   OPTIMIZATION  OPTIONS  -­‐  DEFAULT  OPTIMIZATION  -­‐  No   OPTIMIZATION  OPTIONS  -­‐  MAX  ITERATIONS  -­‐  20   OPTIMIZATION  OPTIONS  -­‐  MAX  FUNCTION  CALLS  -­‐  10  
  41. 41. Stepwise  Regression:  CumulaBve  LIFT  
  42. 42. Stepwise  Regression:  CumulaBve  %   Captured  Response  
  43. 43. Stepwise  Regression:  Final  Variables  
  44. 44. Variable  Comparison   Forward   Backward   Stepwise   AAL11   ACE01   AAL11   ACE01   AEQ01   ACE01   AEQ01   AEQ07   AEQ01   AEQ07   AHI01   AEQ07   AHI01   ALN01   AHI01   ALN01   AMG01   ALN01   AMG01   AMG07   AMG01   AMG07   APR20   AMG07   APR20   LOG_SCORE01   APR20   ART11   AEQ03   ART11   LOG_SCORE01   AEQ04   LOG_SCORE01   AEQ02   ALE01   ALE02  
  45. 45. InteracBon  Terms   •  •  •  •  •  •  log_score01  *  log_score01   log_score01  *  ace01   log_score01  *  amg01   log_score01  *  ahi01   log_score01  *  log_score02   log_score02  *  log_score02  
  46. 46. Model:  Forward  Reg  InteracBon   •  •  •  •  •  •  •  •  •  •  EQUATION  -­‐  USER  TERMS  -­‐  YES   EQUATION  -­‐  TERM  EDITOR  -­‐  Enter  InteracBon  Terms   Input  Coding  -­‐  GLM   MODEL  SELECTION  -­‐  Forward  Reg  InteracBon   SELECTION  CRITERION  -­‐  Akaike  InformaBon  Criterion   USE  SELECTION  DEFAULTS  -­‐  YES   OPTIMIZATION  OPTIONS  -­‐  TECHNIQUE  -­‐  Default   OPTIMIZATION  OPTIONS  -­‐  DEFAULT  OPTIMIZATION  -­‐  No   OPTIMIZATION  OPTIONS  -­‐  MAX  ITERATIONS  -­‐  20   OPTIMIZATION  OPTIONS  -­‐  MAX  FUNCTION  CALLS  -­‐  10  
  47. 47. Forward  Reg  InteracBon:  CumulaBve   LIFT  
  48. 48. Forward  Reg  InteracBon:  CumulaBve   %  Captured  Response  
  49. 49. Forward  Reg  InteracBon:  Final   Variables  
  50. 50. Model:  Backward  Reg  InteracBon   •  •  •  •  •  •  •  •  •  •  EQUATION  -­‐  USER  TERMS  -­‐  YES   EQUATION  -­‐  TERM  EDITOR  -­‐  Enter  InteracBon  Terms   Input  Coding  -­‐  GLM   MODEL  SELECTION  -­‐  Backward  Reg  InteracBon   SELECTION  CRITERION  -­‐  Akaike  InformaBon  Criterion   USE  SELECTION  DEFAULTS  -­‐  YES   OPTIMIZATION  OPTIONS  -­‐  TECHNIQUE  -­‐  Default   OPTIMIZATION  OPTIONS  -­‐  DEFAULT  OPTIMIZATION  -­‐  No   OPTIMIZATION  OPTIONS  -­‐  MAX  ITERATIONS  -­‐  20   OPTIMIZATION  OPTIONS  -­‐  MAX  FUNCTION  CALLS  -­‐  10  
  51. 51. Backward  Reg  InteracBon:  CumulaBve   LIFT  
  52. 52. Backward  Reg  InteracBon:  CumulaBve   %  Captured  Response  
  53. 53. Backward  Reg  InteracBon:  Final   Variables  
  54. 54. Model:  Stepwise  Reg  InteracBon   •  •  •  •  •  •  •  •  •  •  •  EQUATION  -­‐  USER  TERMS  -­‐  YES   EQUATION  -­‐  TERM  EDITOR  -­‐  Enter  InteracBon  Terms   Input  Coding  -­‐  GLM   MODEL  SELECTION  -­‐  Stepwise  Reg  InteracBon   SELECTION  CRITERION  -­‐  Akaike  InformaBon  Criterion   USE  SELECTION  DEFAULTS  -­‐  No   MODEL  SELECTION  -­‐  SELECTION  OPTIONS  –   –  ENTRY  SIGNIFICANCE  LEVEL  =  0.15   –  STAY  SIGNIFICANCE  LEVEL  =  0.05   –  MAXIMUM  NUMBER  OF  STEPS  =  300   OPTIMIZATION  OPTIONS  -­‐  TECHNIQUE  -­‐  Default   OPTIMIZATION  OPTIONS  -­‐  DEFAULT  OPTIMIZATION  -­‐  No   OPTIMIZATION  OPTIONS  -­‐  MAX  ITERATIONS  -­‐  20   OPTIMIZATION  OPTIONS  -­‐  MAX  FUNCTION  CALLS  -­‐  10  
  55. 55. Stepwise  Reg  InteracBon:  CumulaBve   LIFT  
  56. 56. Stepwise  Reg  InteracBon:  CumulaBve   %  Captured  Response  
  57. 57. Stepwise  Reg  InteracBon:  Final   Variables  
  58. 58. Variable  Comparison   Forward_Interac,on   Backward_Interac,on   Stepwise_Interac,on   LOG_SCORE01*ACE01   AEQ01   LOG_SCORE01*ACE01   LOG_SCORE02*AMG01   AEQ07   LOG_SCORE02*AMG01   AAL11   AHI01   AAL11   AEQ01   ALN01   AEQ01   AEQ07   AMG07   AEQ07   AHI01   APR20   AHI01   ALN01   LOG_SCORE01*LOG_SCORE01   ALN01   AMG07   LOG_SCORE01*AHI01   AMG07   APR20   ACE01   APR20   ART11   AEQ03   ART11   AEQ02   AEQ04   ALE01   ALE02   AMG01   LOG_SCORE01  
  59. 59. Model:  Neural  Network   •  •  •  •  •  NETWORK  -­‐  DIRECT  CONNECTION  =  Yes   OPTIMIZATION  -­‐  PRELIMINARY  TRAINING  -­‐  ENABLE  =  No   OPTIMIZATION  -­‐  Maximum  IteraBons  =  50   OPTIMIZATION  -­‐  PRELIMINARY  TRAINING  -­‐  Number  of  Runs  =  10   MODEL  SELECTION  CRITERION  -­‐  MisclassificaBon  
  60. 60. Neural  Network:  CumulaBve  LIFT  
  61. 61. Neural  Network:  CumulaBve  %   Captured  Response  
  62. 62. Neural  Network:  Average  Square  Error   If  we  increase  the  number  of  iteraBons,  then  the  average  square  error  decreases  for  TRAIN  but   increases  for  VALIDATE  data  set        
  63. 63. Ensemble  Node   Select  the  model  that  performs  best  in   –  Decision  Trees   –  Regression   –  Regression  with  InteracBon  Terms       Build  an  Ensemble  Node  on:   –  Pruned  Chaid   –  Forward  Regression   –  Forward  Reg  InteracBon  
  64. 64. Ensemble  Node:  CumulaBve  LIFT  
  65. 65. Ensemble  Node:  CumulaBve  %   Captured  Response  
  66. 66. Model  Comparison   •  ASSESSMENT  REPORTS  -­‐  NUMBER  OF  BINS  =  50   •  MODEL  SELECTION  -­‐  SELECTION  STATISTIC  =  MISCLASSIFICATION  RATE   •  Comparing  LIFT  at  top  20%  
  67. 67. Model  Comparison:  ROC  
  68. 68. Model  Comparison:  CumulaBve  LIFT   (Train)  
  69. 69. Model  Comparison:  CumulaBve  LIFT   (Validate)  
  70. 70. Model  Comparison:  CumulaBve  LIFT   (Test)  
  71. 71. Model  Comparison:  Conclusion   •  TRAIN:     –  Ensemble  works  best,  followed  by  Forward  Regression   –  Check  for  Validate  and  Test  results  to  finalize  the  model   •  VALIDATE  and  TEST   –  Forward  Regression  works  bever  than  Ensemble  
  72. 72. Final  Model   •  Forward  Regression   •  List  of  Variables:   –  AAL11   –  –  –  –  –  –  –  –  –  –  –  ACE01   AEQ01   AEQ07   AHI01   ALN01   AMG01   AMG07   APR20   ART11   LOG_SCORE01   AEQ02  
  73. 73. SCORE   •  In  Model  Comparison  Node-­‐  SCORE  -­‐>  SELECTION  EDITOR     •  Do  YES  for  Forward  Regression  and  NO  for  Stepwise  Reg  InteracBon   (which  was  selected  by  default)   •  Connect  Model  Comparison  with  SCORE,  and  Run  it   •  Get  OpBmized  SAS  code  
  74. 74. Model  Performance   •  PROC  RANK   –  Rank  2:  Top  1/3rd  responders      
  75. 75. Model  Performance  
  76. 76. Thank  You   QuesBons???  

×