Kaggle "Give me some credit" challenge overview

3,914 views

Published on

Full description of the work associated with this project can be found at:
http://www.npcompleteheart.com/project/kaggle-give-me-some-credit/

Published in: Business, Economy & Finance

Kaggle "Give me some credit" challenge overview

  1. 1. Predicting delinquency on debt
  2. 2. What is the problem?
  3. 3. What is the problem? • X Store has a retail credit card available to customers
  4. 4. What is the problem? • X Store has a retail credit card available to customers • There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt
  5. 5. What is the problem? • X Store has a retail credit card available to customers • There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt • This prevents the store from collecting payment for products and services rendered
  6. 6. Is this problem big enough to matter?
  7. 7. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
  8. 8. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years • If only 5% of their carried debt was the store credit card this is potentially an:
  9. 9. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years • If only 5% of their carried debt was the store credit card this is potentially an: • Average loss of $8.12 per customer
  10. 10. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years • If only 5% of their carried debt was the store credit card this is potentially an: • Average loss of $8.12 per customer • Potential overall loss of $1.2 million
  11. 11. What can be done?
  12. 12. What can be done? • There are numerous models that can be used to predict which customers will default
  13. 13. What can be done? • There are numerous models that can be used to predict which customers will default • This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss
  14. 14. What can be done? • There are numerous models that can be used to predict which customers will default • This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss • Or better screen which customers are approved for the card
  15. 15. How will I do this?
  16. 16. How will I do this? • This is a basic classification problem with important business implications
  17. 17. How will I do this? • This is a basic classification problem with important business implications • We’ll examine a few simplistic models to get an idea of performance
  18. 18. How will I do this? • This is a basic classification problem with important business implications • We’ll examine a few simplistic models to get an idea of performance • Explore decision tree methods to achieve better performance
  19. 19. What will the models predict delinquency? Each customer has a number of attributes
  20. 20. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: $1600 Number of Lines: 4
  21. 21. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: $1600 Number of Lines: 4 Mary Rasmussen Delinquent: No Age: 73 Income: $2200 Number of Lines: 2
  22. 22. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: $1600 Number of Lines: 4 Mary Rasmussen Delinquent: No Age: 73 Income: $2200 Number of Lines: 2 ...
  23. 23. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: $1600 Number of Lines: 4 Mary Rasmussen Delinquent: No Age: 73 Income: $2200 Number of Lines: 2 ... We will use the customer attributes to predict whether they were delinquent
  24. 24. How do we make sure that our solution actually has predictive power?
  25. 25. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset
  26. 26. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset Train 150,000 customers Delinquency in dataset
  27. 27. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset Train Test 150,000 customers Delinquency in dataset 101,000 customers Delinquency not in dataset
  28. 28. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset Train Test 150,000 customers Delinquency in dataset 101,000 customers Delinquency not in dataset None of the customers in the test dataset are used to train the model
  29. 29. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train
  30. 30. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train Train 1 Train 2 Train 3
  31. 31. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train Train 1 Train 2 Train 3 Train 1 Train 2 Algorithm Training
  32. 32. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train Train 1 Train 2 Train 3 Train 1 Train 2 Algorithm Training Algorithm Testing Train 3
  33. 33. What matters is how well we can predict the test dataset We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years
  34. 34. Putting accuracy in context
  35. 35. Putting accuracy in context We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it
  36. 36. Putting accuracy in context We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it The potential loss is minimized by ~$8,000 for every 100,000 customers with each percentage point increase in accuracy
  37. 37. Looking at the actual data
  38. 38. Looking at the actual data
  39. 39. Looking at the actual data
  40. 40. Looking at the actual data Assume $2,500
  41. 41. Looking at the actual data Assume $2,500 Assume 0
  42. 42. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower
  43. 43. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance
  44. 44. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance 50%
  45. 45. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance 50%
  46. 46. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance 50% Simple Classification
  47. 47. For simple classification we pick a single attribute and find the best split in the customers
  48. 48. For simple classification we pick a single attribute and find the best split in the customers
  49. 49. For simple classification we pick a single attribute and find the best split in the customers NumberofCustomers Times Past Due
  50. 50. For simple classification we pick a single attribute and find the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1
  51. 51. For simple classification we pick a single attribute and find the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2
  52. 52. For simple classification we pick a single attribute and find the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2
  53. 53. For simple classification we pick a single attribute and find the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2
  54. 54. For simple classification we pick a single attribute and find the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2 ...
  55. 55. We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number
  56. 56. We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent
  57. 57. We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent
  58. 58. 0 20 40 60 80 100 Number of Times 30-59 Days Past Due 0 0.2 0.4 0.6 0.8 Accuracy Precision Sensitivity We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent
  59. 59. 0 20 40 60 80 100 Number of Times 30-59 Days Past Due 0 0.2 0.4 0.6 0.8 Accuracy Precision Sensitivity We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent
  60. 60. 0 20 40 60 80 100 Number of Times 30-59 Days Past Due 0 0.2 0.4 0.6 0.8 Accuracy Precision Sensitivity We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent 0.61 KGI on Test Set
  61. 61. However, not all fields are as informative Using the number of times past due 60-89 days we achieve a KGI of 0.5
  62. 62. However, not all fields are as informative Using the number of times past due 60-89 days we achieve a KGI of 0.5 The approach is naive and could be improved but our time is better spent on different algorithms
  63. 63. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classification 0.50-0.61
  64. 64. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classification 0.50-0.61 Random Forests
  65. 65. A random forest starts from a decision tree Customer Data
  66. 66. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes
  67. 67. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30?
  68. 68. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No 75,000 Customers>30
  69. 69. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No 75,000 Customers>30 Yes 25,000 Customers <30
  70. 70. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No 75,000 Customers>30 Yes 25,000 Customers <30 ...
  71. 71. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1
  72. 72. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1
  73. 73. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 Class assignment of a customer is based on how many of the decision trees “vote” on how to split an attribute
  74. 74. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 We use a large number of trees to not over-fit to the training data Class assignment of a customer is based on how many of the decision trees “vote” on how to split an attribute
  75. 75. The Random Forest algorithm are easily implemented In Python or R for initial testing and validation
  76. 76. The Random Forest algorithm are easily implemented In Python or R for initial testing and validation
  77. 77. The Random Forest algorithm are easily implemented In Python or R for initial testing and validation Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next
  78. 78. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI
  79. 79. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI
  80. 80. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI 1000 trees: 0.850 KGI
  81. 81. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI 1000 trees: 0.850 KGI
  82. 82. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI 1000 trees: 0.850 KGI 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests
  83. 83. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classification 0.50-0.61 Random Forests 0.78-0.85
  84. 84. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classification 0.50-0.61 Random Forests 0.78-0.85 Gradient Tree Boosting
  85. 85. Boosting Trees is similar to a Random Forest Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No Customers >30 Data Yes Customers <30 Data ...
  86. 86. Boosting Trees is similar to a Random Forest Customer Data Is age <30? No Customers >30 Data Yes Customers <30 Data ... Do an exhaustive search for best split
  87. 87. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The first tree is optimized to minimize a loss function describing the data
  88. 88. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The first tree is optimized to minimize a loss function describing the data The next tree is then optimized to fit whatever variability the first tree didn’t fit
  89. 89. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The first tree is optimized to minimize a loss function describing the data The next tree is then optimized to fit whatever variability the first tree didn’t fit This is a sequential process in comparison to the random forest
  90. 90. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The first tree is optimized to minimize a loss function describing the data The next tree is then optimized to fit whatever variability the first tree didn’t fit This is a sequential process in comparison to the random forest We also run the risk of over-fitting to the data, thus the learning rate
  91. 91. Implementing Gradient Boosted Trees In Python or R it is easy for initial testing and validation
  92. 92. Implementing Gradient Boosted Trees In Python or R it is easy for initial testing and validation There are implementations that use Hadoop but it’s more complicated to achieve the best performance
  93. 93. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI
  94. 94. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI 1000 trees, 0.1 Learning: 0.865248 KGI
  95. 95. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI 1000 trees, 0.1 Learning: 0.865248 KGI 0 0.6 0.8 Learning Rate 0.75 0.8 0.85 KGI 0.2 0.4
  96. 96. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI 1000 trees, 0.1 Learning: 0.865248 KGI 0 0.6 0.8 Learning Rate 0.75 0.8 0.85 KGI 0.2 0.4 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests Boosting Trees
  97. 97. Moving one step further in complexity Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classification 0.50-0.61 Random Forests 0.78-0.85 Gradient Tree Boosting 0.71-0.8659 Blended Method
  98. 98. Or more accurately an ensemble of ensemble methods Algorithm Progression
  99. 99. Or more accurately an ensemble of ensemble methods Algorithm Progression Random Forest
  100. 100. Or more accurately an ensemble of ensemble methods Algorithm Progression Random Forest Extremely Random Forest
  101. 101. Or more accurately an ensemble of ensemble methods Algorithm Progression Random Forest Extremely Random Forest Gradient Tree Boosting
  102. 102. Or more accurately an ensemble of ensemble methods Algorithm ProgressionTrain Data Probabilities Random Forest Extremely Random Forest Gradient Tree Boosting 0.1 0.5 0.01 0.8 0.7 . . .
  103. 103. Or more accurately an ensemble of ensemble methods Algorithm ProgressionTrain Data Probabilities Random Forest Extremely Random Forest Gradient Tree Boosting 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . .
  104. 104. Or more accurately an ensemble of ensemble methods Algorithm ProgressionTrain Data Probabilities Random Forest Extremely Random Forest Gradient Tree Boosting 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . .
  105. 105. Combine all of the model information Train Data Probabilities 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . .
  106. 106. Combine all of the model information Train Data Probabilities 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . . Optimize the set of train probabilities to the known delinquencies
  107. 107. Combine all of the model information Train Data Probabilities 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . . Optimize the set of train probabilities to the known delinquencies Apply the same weighting scheme to the set of test data probabilities
  108. 108. Implementation can be done in a number of ways Testing in Python or R is slower, due to the sequential nature of applying the algorithms Could be faster parallelized, running each algorithm separately and combining the results
  109. 109. Assessing model performance Blending Performance, 100 trees: 0.864394 KGI
  110. 110. Assessing model performance Blending Performance, 100 trees: 0.864394 KGI 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests Boosting Trees Blended
  111. 111. Assessing model performance Blending Performance, 100 trees: 0.864394 KGI But this performance and the possibility of additional gains comes at a distinct time cost. 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests Boosting Trees Blended
  112. 112. Examining the continuum of choices Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classification 0.50-0.61 Random Forests 0.78-0.85 Gradient Tree Boosting 0.71-0.8659 Blended Method 0.864
  113. 113. What would be best to implement?
  114. 114. What would be best to implement? There is a large amount of optimization in the blended method that could be done
  115. 115. What would be best to implement? There is a large amount of optimization in the blended method that could be done However, this algorithm takes the longest to run. This constraint will apply in testing and validation also
  116. 116. What would be best to implement? There is a large amount of optimization in the blended method that could be done However, this algorithm takes the longest to run. This constraint will apply in testing and validation also Random Forests returns a reasonably good result. It is quick and easily parallelized
  117. 117. What would be best to implement? There is a large amount of optimization in the blended method that could be done However, this algorithm takes the longest to run. This constraint will apply in testing and validation also Random Forests returns a reasonably good result. It is quick and easily parallelized Gradient Tree Boosting returns the best result and runs reasonably fast. It is not as easily parallelized though
  118. 118. What would be best to implement? Random Forests returns a reasonably good result. It is quick and easily parallelized Gradient Tree Boosting returns the best result and runs reasonably fast. It is not as easily parallelized though
  119. 119. Increases in predictive performance have real business value Using any of the more complex algorithms we achieve an increase of 35% in comparison to random
  120. 120. Increases in predictive performance have real business value Using any of the more complex algorithms we achieve an increase of 35% in comparison to random Potential decrease of ~$420k in losses by identifying customers likely to default in the training set alone
  121. 121. Thank you for your time

×