Upcoming SlideShare
×

# Kaggle "Give me some credit" challenge overview

1,733

Published on

Full description of the work associated with this project can be found at:
http://www.npcompleteheart.com/project/kaggle-give-me-some-credit/

Published in: Business, Economy & Finance
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
1,733
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
79
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Kaggle "Give me some credit" challenge overview

1. 1. Predicting delinquency on debt
2. 2. What is the problem?
3. 3. What is the problem? • X Store has a retail credit card available to customers
4. 4. What is the problem? • X Store has a retail credit card available to customers • There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt
5. 5. What is the problem? • X Store has a retail credit card available to customers • There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt • This prevents the store from collecting payment for products and services rendered
6. 6. Is this problem big enough to matter?
7. 7. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we ﬁnd that 6.6% of customers were seriously delinquent in payment the last two years
8. 8. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we ﬁnd that 6.6% of customers were seriously delinquent in payment the last two years • If only 5% of their carried debt was the store credit card this is potentially an:
9. 9. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we ﬁnd that 6.6% of customers were seriously delinquent in payment the last two years • If only 5% of their carried debt was the store credit card this is potentially an: • Average loss of \$8.12 per customer
10. 10. Is this problem big enough to matter? • Examining a slice of the customer database (150,000 customers) we ﬁnd that 6.6% of customers were seriously delinquent in payment the last two years • If only 5% of their carried debt was the store credit card this is potentially an: • Average loss of \$8.12 per customer • Potential overall loss of \$1.2 million
11. 11. What can be done?
12. 12. What can be done? • There are numerous models that can be used to predict which customers will default
13. 13. What can be done? • There are numerous models that can be used to predict which customers will default • This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss
14. 14. What can be done? • There are numerous models that can be used to predict which customers will default • This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss • Or better screen which customers are approved for the card
15. 15. How will I do this?
16. 16. How will I do this? • This is a basic classiﬁcation problem with important business implications
17. 17. How will I do this? • This is a basic classiﬁcation problem with important business implications • We’ll examine a few simplistic models to get an idea of performance
18. 18. How will I do this? • This is a basic classiﬁcation problem with important business implications • We’ll examine a few simplistic models to get an idea of performance • Explore decision tree methods to achieve better performance
19. 19. What will the models predict delinquency? Each customer has a number of attributes
20. 20. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: \$1600 Number of Lines: 4
21. 21. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: \$1600 Number of Lines: 4 Mary Rasmussen Delinquent: No Age: 73 Income: \$2200 Number of Lines: 2
22. 22. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: \$1600 Number of Lines: 4 Mary Rasmussen Delinquent: No Age: 73 Income: \$2200 Number of Lines: 2 ...
23. 23. What will the models predict delinquency? Each customer has a number of attributes John Smith Delinquent:Yes Age: 23 Income: \$1600 Number of Lines: 4 Mary Rasmussen Delinquent: No Age: 73 Income: \$2200 Number of Lines: 2 ... We will use the customer attributes to predict whether they were delinquent
24. 24. How do we make sure that our solution actually has predictive power?
25. 25. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset
26. 26. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset Train 150,000 customers Delinquency in dataset
27. 27. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset Train Test 150,000 customers Delinquency in dataset 101,000 customers Delinquency not in dataset
28. 28. How do we make sure that our solution actually has predictive power? We have two slices of the customer dataset Train Test 150,000 customers Delinquency in dataset 101,000 customers Delinquency not in dataset None of the customers in the test dataset are used to train the model
29. 29. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train
30. 30. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train Train 1 Train 2 Train 3
31. 31. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train Train 1 Train 2 Train 3 Train 1 Train 2 Algorithm Training
32. 32. Internally we validate our model performance with cross-fold validation Using only the train dataset we can get a sense of how well our model performs without externally validating it Train Train 1 Train 2 Train 3 Train 1 Train 2 Algorithm Training Algorithm Testing Train 3
33. 33. What matters is how well we can predict the test dataset We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years
34. 34. Putting accuracy in context
35. 35. Putting accuracy in context We could save \$600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it
36. 36. Putting accuracy in context We could save \$600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it The potential loss is minimized by ~\$8,000 for every 100,000 customers with each percentage point increase in accuracy
37. 37. Looking at the actual data
38. 38. Looking at the actual data
39. 39. Looking at the actual data
40. 40. Looking at the actual data Assume \$2,500
41. 41. Looking at the actual data Assume \$2,500 Assume 0
42. 42. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower
43. 43. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance
44. 44. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance 50%
45. 45. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance 50%
46. 46. There is a continuum of algorithmic choices to tackle the problem Simpler, Quicker Complex, Slower Random Chance 50% Simple Classiﬁcation
47. 47. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers
48. 48. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers
49. 49. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers NumberofCustomers Times Past Due
50. 50. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1
51. 51. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2
52. 52. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2
53. 53. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2
54. 54. For simple classiﬁcation we pick a single attribute and ﬁnd the best split in the customers NumberofCustomers Times Past Due True Positive True Negative False Positive False Negative 1 2 ...
55. 55. We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number
56. 56. We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent
57. 57. We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent
58. 58. 0 20 40 60 80 100 Number of Times 30-59 Days Past Due 0 0.2 0.4 0.6 0.8 Accuracy Precision Sensitivity We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent
59. 59. 0 20 40 60 80 100 Number of Times 30-59 Days Past Due 0 0.2 0.4 0.6 0.8 Accuracy Precision Sensitivity We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent
60. 60. 0 20 40 60 80 100 Number of Times 30-59 Days Past Due 0 0.2 0.4 0.6 0.8 Accuracy Precision Sensitivity We evaluate possible splits using accuracy, precision, and sensitivity Acc = Number correct Total Number Prec = True Positives Number of People Predicted Delinquent Sens = True Positives Number of People Actually Delinquent 0.61 KGI on Test Set
61. 61. However, not all ﬁelds are as informative Using the number of times past due 60-89 days we achieve a KGI of 0.5
62. 62. However, not all ﬁelds are as informative Using the number of times past due 60-89 days we achieve a KGI of 0.5 The approach is naive and could be improved but our time is better spent on different algorithms
63. 63. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classiﬁcation 0.50-0.61
64. 64. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classiﬁcation 0.50-0.61 Random Forests
65. 65. A random forest starts from a decision tree Customer Data
66. 66. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes
67. 67. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30?
68. 68. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No 75,000 Customers>30
69. 69. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No 75,000 Customers>30 Yes 25,000 Customers <30
70. 70. A random forest starts from a decision tree Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No 75,000 Customers>30 Yes 25,000 Customers <30 ...
71. 71. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1
72. 72. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1
73. 73. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 Class assignment of a customer is based on how many of the decision trees “vote” on how to split an attribute
74. 74. A random forest is composed of many decision trees ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 We use a large number of trees to not over-ﬁt to the training data Class assignment of a customer is based on how many of the decision trees “vote” on how to split an attribute
75. 75. The Random Forest algorithm are easily implemented In Python or R for initial testing and validation
76. 76. The Random Forest algorithm are easily implemented In Python or R for initial testing and validation
77. 77. The Random Forest algorithm are easily implemented In Python or R for initial testing and validation Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next
78. 78. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI
79. 79. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI
80. 80. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI 1000 trees: 0.850 KGI
81. 81. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI 1000 trees: 0.850 KGI
82. 82. A random forest performs well on the test set Random Forest 10 trees: 0.779 KGI 150 trees: 0.843 KGI 1000 trees: 0.850 KGI 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests
83. 83. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classiﬁcation 0.50-0.61 Random Forests 0.78-0.85
84. 84. Exploring algorithmic choices further Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classiﬁcation 0.50-0.61 Random Forests 0.78-0.85 Gradient Tree Boosting
85. 85. Boosting Trees is similar to a Random Forest Customer Data Find the best split in a set of randomly chosen attributes Is age <30? No Customers >30 Data Yes Customers <30 Data ...
86. 86. Boosting Trees is similar to a Random Forest Customer Data Is age <30? No Customers >30 Data Yes Customers <30 Data ... Do an exhaustive search for best split
87. 87. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The ﬁrst tree is optimized to minimize a loss function describing the data
88. 88. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The ﬁrst tree is optimized to minimize a loss function describing the data The next tree is then optimized to ﬁt whatever variability the ﬁrst tree didn’t ﬁt
89. 89. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The ﬁrst tree is optimized to minimize a loss function describing the data The next tree is then optimized to ﬁt whatever variability the ﬁrst tree didn’t ﬁt This is a sequential process in comparison to the random forest
90. 90. How Gradient Boosting Trees differs from Random Forest ... Customer Data Best Split No Customers Data Set 2 Yes Customers Data Set 1 The ﬁrst tree is optimized to minimize a loss function describing the data The next tree is then optimized to ﬁt whatever variability the ﬁrst tree didn’t ﬁt This is a sequential process in comparison to the random forest We also run the risk of over-ﬁtting to the data, thus the learning rate
91. 91. Implementing Gradient Boosted Trees In Python or R it is easy for initial testing and validation
92. 92. Implementing Gradient Boosted Trees In Python or R it is easy for initial testing and validation There are implementations that use Hadoop but it’s more complicated to achieve the best performance
93. 93. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI
94. 94. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI 1000 trees, 0.1 Learning: 0.865248 KGI
95. 95. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI 1000 trees, 0.1 Learning: 0.865248 KGI 0 0.6 0.8 Learning Rate 0.75 0.8 0.85 KGI 0.2 0.4
96. 96. Gradient Boosting Trees performs well on the dataset 100 trees, 0.1 Learning: 0.865022 KGI 1000 trees, 0.1 Learning: 0.865248 KGI 0 0.6 0.8 Learning Rate 0.75 0.8 0.85 KGI 0.2 0.4 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests Boosting Trees
97. 97. Moving one step further in complexity Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classiﬁcation 0.50-0.61 Random Forests 0.78-0.85 Gradient Tree Boosting 0.71-0.8659 Blended Method
98. 98. Or more accurately an ensemble of ensemble methods Algorithm Progression
99. 99. Or more accurately an ensemble of ensemble methods Algorithm Progression Random Forest
100. 100. Or more accurately an ensemble of ensemble methods Algorithm Progression Random Forest Extremely Random Forest
101. 101. Or more accurately an ensemble of ensemble methods Algorithm Progression Random Forest Extremely Random Forest Gradient Tree Boosting
102. 102. Or more accurately an ensemble of ensemble methods Algorithm ProgressionTrain Data Probabilities Random Forest Extremely Random Forest Gradient Tree Boosting 0.1 0.5 0.01 0.8 0.7 . . .
103. 103. Or more accurately an ensemble of ensemble methods Algorithm ProgressionTrain Data Probabilities Random Forest Extremely Random Forest Gradient Tree Boosting 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . .
104. 104. Or more accurately an ensemble of ensemble methods Algorithm ProgressionTrain Data Probabilities Random Forest Extremely Random Forest Gradient Tree Boosting 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . .
105. 105. Combine all of the model information Train Data Probabilities 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . .
106. 106. Combine all of the model information Train Data Probabilities 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . . Optimize the set of train probabilities to the known delinquencies
107. 107. Combine all of the model information Train Data Probabilities 0.1 0.5 0.01 0.8 0.7 . . . 0.15 0.6 0.0 0.75 0.68 . . . Optimize the set of train probabilities to the known delinquencies Apply the same weighting scheme to the set of test data probabilities
108. 108. Implementation can be done in a number of ways Testing in Python or R is slower, due to the sequential nature of applying the algorithms Could be faster parallelized, running each algorithm separately and combining the results
109. 109. Assessing model performance Blending Performance, 100 trees: 0.864394 KGI
110. 110. Assessing model performance Blending Performance, 100 trees: 0.864394 KGI 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests Boosting Trees Blended
111. 111. Assessing model performance Blending Performance, 100 trees: 0.864394 KGI But this performance and the possibility of additional gains comes at a distinct time cost. 0.4 0.5 0.6 0.7 0.8 0.9 Random Accuracy Classification Random Forests Boosting Trees Blended
112. 112. Examining the continuum of choices Simpler, Quicker Complex, Slower Random Chance 0.50 Simple Classiﬁcation 0.50-0.61 Random Forests 0.78-0.85 Gradient Tree Boosting 0.71-0.8659 Blended Method 0.864
113. 113. What would be best to implement?
114. 114. What would be best to implement? There is a large amount of optimization in the blended method that could be done
115. 115. What would be best to implement? There is a large amount of optimization in the blended method that could be done However, this algorithm takes the longest to run. This constraint will apply in testing and validation also
116. 116. What would be best to implement? There is a large amount of optimization in the blended method that could be done However, this algorithm takes the longest to run. This constraint will apply in testing and validation also Random Forests returns a reasonably good result. It is quick and easily parallelized
117. 117. What would be best to implement? There is a large amount of optimization in the blended method that could be done However, this algorithm takes the longest to run. This constraint will apply in testing and validation also Random Forests returns a reasonably good result. It is quick and easily parallelized Gradient Tree Boosting returns the best result and runs reasonably fast. It is not as easily parallelized though
118. 118. What would be best to implement? Random Forests returns a reasonably good result. It is quick and easily parallelized Gradient Tree Boosting returns the best result and runs reasonably fast. It is not as easily parallelized though
119. 119. Increases in predictive performance have real business value Using any of the more complex algorithms we achieve an increase of 35% in comparison to random
120. 120. Increases in predictive performance have real business value Using any of the more complex algorithms we achieve an increase of 35% in comparison to random Potential decrease of ~\$420k in losses by identifying customers likely to default in the training set alone
121. 121. Thank you for your time
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.