Predicting Credit Card Defaults using Machine Learning AlgorithmsSagar Tupkar
This is a project that I worked on as a Capstone for my Masters in Business Analytics program at the University of Cincinnati. In this project, I have performed an end-to-end data mining exercise including data cleaning, distribution analysis, exploratory data analysis, model building etc. to identify and predict Credit Card defaults using Customer's data on past payments and general profile. In the process for building Machine Learning models, I have fit and compared the performance of multiple models and algorithms like Logistic Regreesion, PCA, Classification tree, AdaBoost Classifier, ANN and LDA.
Default credit cards are an important issue that bring negative consequences to both sides, i.e, banks and customer. If a customer does not pay his obligations, banks loose money, the customer will lose credibility in future payments, collection calls start to be made and in last resort, the case may go into the court. In order to avoid all of that trouble, effective methods that are able to predict the default of credit cards are needed. Therefore, default credit card prediction is an important, challenging and useful task that should be addressed.
This presentation documents how the problem can be addressed, following the pipeline of a typical Patter Recognition application. The main task is to classify a set of samples representing the history of payments and bill statements of a given client plus some background information about the client according to its ability to pay or not (Default) the next monthly payment of its credit card.
Predicting Credit Card Defaults using Machine Learning AlgorithmsSagar Tupkar
This is a project that I worked on as a Capstone for my Masters in Business Analytics program at the University of Cincinnati. In this project, I have performed an end-to-end data mining exercise including data cleaning, distribution analysis, exploratory data analysis, model building etc. to identify and predict Credit Card defaults using Customer's data on past payments and general profile. In the process for building Machine Learning models, I have fit and compared the performance of multiple models and algorithms like Logistic Regreesion, PCA, Classification tree, AdaBoost Classifier, ANN and LDA.
Default credit cards are an important issue that bring negative consequences to both sides, i.e, banks and customer. If a customer does not pay his obligations, banks loose money, the customer will lose credibility in future payments, collection calls start to be made and in last resort, the case may go into the court. In order to avoid all of that trouble, effective methods that are able to predict the default of credit cards are needed. Therefore, default credit card prediction is an important, challenging and useful task that should be addressed.
This presentation documents how the problem can be addressed, following the pipeline of a typical Patter Recognition application. The main task is to classify a set of samples representing the history of payments and bill statements of a given client plus some background information about the client according to its ability to pay or not (Default) the next monthly payment of its credit card.
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
Project Details: In this study, the concept and application of credit scoring in a German banking environment is
explained. A credit scoring model has been developed using logistic regression and random forest. Limitations of
the model are explained and possible solutions are given with an overview of LASSO.
Guide: Dr. Sibnarayan Guria, Associate Professor and Head of the Department, Department of
Statistics, West Bengal State University
Language Used: R
Predictive Model for Loan Approval Process using SAS 9.3_M1Akanksha Jain
This is a Predictive Model which uses Logistic Regression to statistically help make better loan approval decisions in future for a German Bank. It uses an historical credit data set with 1000 data points and 20 variables.
Tool used:
SAS 9.3_M1
Steps Involved are:
- Data Quality check using Correlations and VIF Tests
- Analysis of different Variable Selection Methods such as Forward, Backward and Stepwise
- Variable Selection on the basis of Parameter Estimates and Odds Ratio
- Outlier Analysis to identify the outliers and improve the model
- Final Model Selection Decision based on ROC curve, Percent Concordant, PROC Rank and Hosmer Lemeshow Test
Customer churn has become a big issue in many banks because it costs a lot more to acquire a new customer than retaining existing ones. With the use of a customer churn prediction model possible churners in a bank can be identified, and as a result the bank can take some action to prevent them from leaving. In order to set up such a model in a bank in Iceland few things have to be considered. How a churner in a bank is defined, and which variables and methods to use. We propose that a churner for that Icelandic bank should be defined as a customer who has not been active for the last three months based on the bank definition of an active customer. Behavioral and demographic variables should be used as an input for the model, and either decision tree or logistic regression used as a technique.
Credit risks are calculated based on the borrowers’ overall ability to repay. Our objective was to use optimization in order to create a tool that approves or rejects loans to borrowers. We also used optimization to establish how much interest rate/credit will be extended to borrowers who were approved for a loan.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
What is Predictive Analytics?
Predictive Analytics is the stream of the advanced analytics which utilizes diverse techniques like data mining, predictive modelling, statistics, machine learning and artificial intelligence to analyse current data and predict future.
To Know more: https://goo.gl/zAcnCR
LOAN DEFAULT PREDICTION – A CASE STUDY
Content Covered in this video:
Business Problem & Benefits
The Risk - LOAN DEFAULT PREDICTION
Data Analysis Process
Data Processing
Predictive Analysis Process
Tools & Technology
There are 100,000 applicants for loans. Who is likely to default? How to effectively offer a loan
There are 100,000 consumers who is likely to buy my product? How to effectively market my product?
There are more than 1,000,000,000 transactions in a day. How to identify the fraud transaction?
There are 1,000,000 claims every year. How to identify the fake claims
Certain Cases of Customers default on Payments in Taiwan.
From a Risk Management Perspective a Bank/Credit Card Company is more interested in minimizing their losses towards a particular customer.
The information that is more valuable to them is estimating the probability of default rather than classifying a customer as credible/not credible.
Goal: To compute the predictive accuracy of probability of default for a Taiwanese Credit Card Client.
Problem Analysis – Classify Probability of default for next month: 1 as “Default” and 0 as “Not Default”.
BUSINESS CASE
An International Banking Group implemented Early Warning System and strategies
to reduce the number of cases to be treated in the recovery process and the overall
collection costs.
Explore our students' cutting-edge project on predicting bank customer churn using advanced analytics techniques. This project employs machine learning algorithms to analyze customer data and forecast the likelihood of churn, offering valuable insights for financial institutions. Gain insights into customer retention strategies, predictive modeling, and the potential impact on banking operations. To learn more, do check out https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
Project Details: In this study, the concept and application of credit scoring in a German banking environment is
explained. A credit scoring model has been developed using logistic regression and random forest. Limitations of
the model are explained and possible solutions are given with an overview of LASSO.
Guide: Dr. Sibnarayan Guria, Associate Professor and Head of the Department, Department of
Statistics, West Bengal State University
Language Used: R
Predictive Model for Loan Approval Process using SAS 9.3_M1Akanksha Jain
This is a Predictive Model which uses Logistic Regression to statistically help make better loan approval decisions in future for a German Bank. It uses an historical credit data set with 1000 data points and 20 variables.
Tool used:
SAS 9.3_M1
Steps Involved are:
- Data Quality check using Correlations and VIF Tests
- Analysis of different Variable Selection Methods such as Forward, Backward and Stepwise
- Variable Selection on the basis of Parameter Estimates and Odds Ratio
- Outlier Analysis to identify the outliers and improve the model
- Final Model Selection Decision based on ROC curve, Percent Concordant, PROC Rank and Hosmer Lemeshow Test
Customer churn has become a big issue in many banks because it costs a lot more to acquire a new customer than retaining existing ones. With the use of a customer churn prediction model possible churners in a bank can be identified, and as a result the bank can take some action to prevent them from leaving. In order to set up such a model in a bank in Iceland few things have to be considered. How a churner in a bank is defined, and which variables and methods to use. We propose that a churner for that Icelandic bank should be defined as a customer who has not been active for the last three months based on the bank definition of an active customer. Behavioral and demographic variables should be used as an input for the model, and either decision tree or logistic regression used as a technique.
Credit risks are calculated based on the borrowers’ overall ability to repay. Our objective was to use optimization in order to create a tool that approves or rejects loans to borrowers. We also used optimization to establish how much interest rate/credit will be extended to borrowers who were approved for a loan.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
What is Predictive Analytics?
Predictive Analytics is the stream of the advanced analytics which utilizes diverse techniques like data mining, predictive modelling, statistics, machine learning and artificial intelligence to analyse current data and predict future.
To Know more: https://goo.gl/zAcnCR
LOAN DEFAULT PREDICTION – A CASE STUDY
Content Covered in this video:
Business Problem & Benefits
The Risk - LOAN DEFAULT PREDICTION
Data Analysis Process
Data Processing
Predictive Analysis Process
Tools & Technology
There are 100,000 applicants for loans. Who is likely to default? How to effectively offer a loan
There are 100,000 consumers who is likely to buy my product? How to effectively market my product?
There are more than 1,000,000,000 transactions in a day. How to identify the fraud transaction?
There are 1,000,000 claims every year. How to identify the fake claims
Certain Cases of Customers default on Payments in Taiwan.
From a Risk Management Perspective a Bank/Credit Card Company is more interested in minimizing their losses towards a particular customer.
The information that is more valuable to them is estimating the probability of default rather than classifying a customer as credible/not credible.
Goal: To compute the predictive accuracy of probability of default for a Taiwanese Credit Card Client.
Problem Analysis – Classify Probability of default for next month: 1 as “Default” and 0 as “Not Default”.
BUSINESS CASE
An International Banking Group implemented Early Warning System and strategies
to reduce the number of cases to be treated in the recovery process and the overall
collection costs.
Explore our students' cutting-edge project on predicting bank customer churn using advanced analytics techniques. This project employs machine learning algorithms to analyze customer data and forecast the likelihood of churn, offering valuable insights for financial institutions. Gain insights into customer retention strategies, predictive modeling, and the potential impact on banking operations. To learn more, do check out https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Presentation of Birnbaum's Likelihood Principle foundational paper at the Reading Statistical Classics seminar, Jan. 20, 2013, Université Paris-Dauphine
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
Valencian Summer School 2015
Day 1
Lecture 3
Ensembles of Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Machine Learning and Data Mining: 16 Classifiers EnsemblesPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we introduce classifiers ensembles.
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)Sri Ambati
Dr. Trevor Hastie of Stanford University discusses the data science behind Gradient Boosted Regression and Classification
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
These slides use concepts from my (Jeff Funk) course entitled Biz Models for Hi-Tech Products to analyze the business model for Kaggle’s Crowd Sourcing Service for Data Analytics. Kaggle connects data scientists with organizations who have problems related to data analysis. Kaggle helps organizations define their data analytic problems, present them to data scientists, and organize and evaluate competitions between data analytic solutions. Its data ensemble technique also evaluates the effectiveness of the various solutions. These slides describe the specific value proposition for organizations and data scientists and other aspects of the business model such as the method of value capture, scope of activities, and method of strategic control.
Dive deep into the world of insurance churn prediction with this captivating data analysis project presented by Boston Institute of Analytics. Our talented students embark on a journey to unravel the mysteries behind customer churn in the insurance industry, leveraging advanced data analysis techniques to forecast and anticipate customer behavior. From analyzing historical data and customer demographics to identifying predictive indicators and developing churn prediction models, this project offers a comprehensive exploration of the factors influencing insurance churn dynamics. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on insurance churn prediction. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Customer churn has been evolving as one of the major problems for financial organizations. The incessant competitions in the market and high cost of acquiring new customers have made organizations to drive their focus towards more effective customer retention strategies.
Customer Lifetime Value for Insurance AgentsScott Boren
Customer lifetime value for insurance agents was presented by Scott Boren to the BIG Insurance Group in Southern California. The lecture was designed to share insight from his consulting firm and the impact a customer lifetime strategy can have on an insurance agent's service, marketing, and in identifying developing customer personas.
Real Estate Executive Summary (MKT460 Lab #5)Mira McKee
This is a lab report I wrote for my Marketing 460: Information & Analysis class. Utilizing SPSS, a statistical analysis program, to analyze real estate data, I wrote this report detailing my research steps (including regression analysis, CHAID analysis, customer profiles, etc.) and conclusion. I designed the Lab Report in Canva.
Being Right Starts By Knowing You're WrongData Con LA
Data Con LA 2020
Description
The recent proliferation of predictive analytics within companies is of limited benefit unless these companies learn to measure, understand, and embrace a critical concept: error. There is no such thing as a perfect predictive model and all tools using any sort of predictive model will have error. Despite being relatively easy to implement and understand, consistent error measurement continues to be underutilized or even completely avoided. In this session we will discuss
*Why embracing error is so valuable to companies.
*We will then review basic ways to measure error in commonly used models and in data source systems such as CRMs and ERPs.
*Most importantly, we will review some ways to approach company leadership with the concept of error.
Speaker
Ryan Johnson, GoGuardian, Director of Science and Analytics
Predictive Analytics for Customer Targeting: A Telemarketing Banking ExamplePedro Ecija Serrano
A comparison of classification methods to predict buyers in banking telemarketing. Overcoming class imbalance and gaining insight on what customers are likely to buy a particular financial product.
Transform Your Credit and Collections with Predictive AnalyticsFIS
Companies are increasingly driving collections by harnessing the power of big data using tools like predictive analytics. View our slideshow to lean how collection departments are increasingly using the big data they have on customers and they are using advanced tools like predictive analytics to drive strategic collections.
A quick overview of why to use and how to set up iPython notebooks for researchAdam Pah
A quick overview of why to use and how to set up iPython notebooks for research in the Amaral lab. Example notebook is a gist at:
http://nbviewer.ipython.org/gist/anonymous/f8e6d8985d2ea0e4bab1
Super basic tutorial about how to create an interactive web graphic with d3.js for those that know how to program/use the command line but not any web technology.
3.0 Project 2_ Developing My Brand Identity Kit.pptxtanyjahb
A personal brand exploration presentation summarizes an individual's unique qualities and goals, covering strengths, values, passions, and target audience. It helps individuals understand what makes them stand out, their desired image, and how they aim to achieve it.
B2B payments are rapidly changing. Find out the 5 key questions you need to be asking yourself to be sure you are mastering B2B payments today. Learn more at www.BlueSnap.com.
Personal Brand Statement:
As an Army veteran dedicated to lifelong learning, I bring a disciplined, strategic mindset to my pursuits. I am constantly expanding my knowledge to innovate and lead effectively. My journey is driven by a commitment to excellence, and to make a meaningful impact in the world.
Implicitly or explicitly all competing businesses employ a strategy to select a mix
of marketing resources. Formulating such competitive strategies fundamentally
involves recognizing relationships between elements of the marketing mix (e.g.,
price and product quality), as well as assessing competitive and market conditions
(i.e., industry structure in the language of economics).
Digital Transformation and IT Strategy Toolkit and TemplatesAurelien Domont, MBA
This Digital Transformation and IT Strategy Toolkit was created by ex-McKinsey, Deloitte and BCG Management Consultants, after more than 5,000 hours of work. It is considered the world's best & most comprehensive Digital Transformation and IT Strategy Toolkit. It includes all the Frameworks, Best Practices & Templates required to successfully undertake the Digital Transformation of your organization and define a robust IT Strategy.
Editable Toolkit to help you reuse our content: 700 Powerpoint slides | 35 Excel sheets | 84 minutes of Video training
This PowerPoint presentation is only a small preview of our Toolkits. For more details, visit www.domontconsulting.com
Building Your Employer Brand with Social MediaLuanWise
Presented at The Global HR Summit, 6th June 2024
In this keynote, Luan Wise will provide invaluable insights to elevate your employer brand on social media platforms including LinkedIn, Facebook, Instagram, X (formerly Twitter) and TikTok. You'll learn how compelling content can authentically showcase your company culture, values, and employee experiences to support your talent acquisition and retention objectives. Additionally, you'll understand the power of employee advocacy to amplify reach and engagement – helping to position your organization as an employer of choice in today's competitive talent landscape.
Event Report - SAP Sapphire 2024 Orlando - lots of innovation and old challengesHolger Mueller
Holger Mueller of Constellation Research shares his key takeaways from SAP's Sapphire confernece, held in Orlando, June 3rd till 5th 2024, in the Orange Convention Center.
Discover the innovative and creative projects that highlight my journey throu...dylandmeas
Discover the innovative and creative projects that highlight my journey through Full Sail University. Below, you’ll find a collection of my work showcasing my skills and expertise in digital marketing, event planning, and media production.
Premium MEAN Stack Development Solutions for Modern BusinessesSynapseIndia
Stay ahead of the curve with our premium MEAN Stack Development Solutions. Our expert developers utilize MongoDB, Express.js, AngularJS, and Node.js to create modern and responsive web applications. Trust us for cutting-edge solutions that drive your business growth and success.
Know more: https://www.synapseindia.com/technology/mean-stack-development-company.html
Cracking the Workplace Discipline Code Main.pptxWorkforce Group
Cultivating and maintaining discipline within teams is a critical differentiator for successful organisations.
Forward-thinking leaders and business managers understand the impact that discipline has on organisational success. A disciplined workforce operates with clarity, focus, and a shared understanding of expectations, ultimately driving better results, optimising productivity, and facilitating seamless collaboration.
Although discipline is not a one-size-fits-all approach, it can help create a work environment that encourages personal growth and accountability rather than solely relying on punitive measures.
In this deck, you will learn the significance of workplace discipline for organisational success. You’ll also learn
• Four (4) workplace discipline methods you should consider
• The best and most practical approach to implementing workplace discipline.
• Three (3) key tips to maintain a disciplined workplace.
3. What is the problem?
• X Store has a retail credit card available to
customers
4. What is the problem?
• X Store has a retail credit card available to
customers
• There can be a number of sources of loss
from this product, but one is customer’s
defaulting on their debt
5. What is the problem?
• X Store has a retail credit card available to
customers
• There can be a number of sources of loss
from this product, but one is customer’s
defaulting on their debt
• This prevents the store from collecting
payment for products and services
rendered
7. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
8. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
• If only 5% of their carried debt was the
store credit card this is potentially an:
9. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
• If only 5% of their carried debt was the
store credit card this is potentially an:
• Average loss of $8.12 per customer
10. Is this problem big enough to matter?
• Examining a slice of the customer database
(150,000 customers) we find that 6.6% of
customers were seriously delinquent in
payment the last two years
• If only 5% of their carried debt was the
store credit card this is potentially an:
• Average loss of $8.12 per customer
• Potential overall loss of $1.2 million
12. What can be done?
• There are numerous models that can be
used to predict which customers will
default
13. What can be done?
• There are numerous models that can be
used to predict which customers will
default
• This could be used to decrease credit limits
or cancel credit lines for current risky
customers to minimize potential loss
14. What can be done?
• There are numerous models that can be
used to predict which customers will
default
• This could be used to decrease credit limits
or cancel credit lines for current risky
customers to minimize potential loss
• Or better screen which customers are
approved for the card
16. How will I do this?
• This is a basic classification problem with
important business implications
17. How will I do this?
• This is a basic classification problem with
important business implications
• We’ll examine a few simplistic models to
get an idea of performance
18. How will I do this?
• This is a basic classification problem with
important business implications
• We’ll examine a few simplistic models to
get an idea of performance
• Explore decision tree methods to achieve
better performance
19. What will the models predict delinquency?
Each customer has a number of attributes
20. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
21. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
22. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
...
23. What will the models predict delinquency?
Each customer has a number of attributes
John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
...
We will use the customer attributes to predict
whether they were delinquent
24. How do we make sure that our solution actually
has predictive power?
25. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
26. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
Train
150,000
customers
Delinquency
in dataset
27. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
Train Test
150,000
customers
Delinquency
in dataset
101,000
customers
Delinquency
not in
dataset
28. How do we make sure that our solution actually
has predictive power?
We have two slices of the customer dataset
Train Test
150,000
customers
Delinquency
in dataset
101,000
customers
Delinquency
not in
dataset
None of the customers in the test dataset are
used to train the model
29. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
30. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
Train 1
Train 2
Train 3
31. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
Train 1
Train 2
Train 3
Train 1
Train 2
Algorithm
Training
32. Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train
Train 1
Train 2
Train 3
Train 1
Train 2
Algorithm
Training
Algorithm
Testing
Train 3
33. What matters is how well we can predict
the test dataset
We judge this using the accuracy, which is the number
of our predictions correct out of the total number of
predictions made
So with 100,000 customers and an 80% accuracy we
will have correctly predicted whether 80,000
customers will default or not in the next two years
35. Putting accuracy in context
We could save $600,000 over two years if we
correctly predicted 50% of the customers that would
default and changed their account to prevent it
36. Putting accuracy in context
We could save $600,000 over two years if we
correctly predicted 50% of the customers that would
default and changed their account to prevent it
The potential loss is minimized by ~$8,000 for every
100,000 customers with each percentage point
increase in accuracy
49. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
50. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1
51. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2
52. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2
53. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2
54. For simple classification we pick a single attribute
and find the best split in the customers
NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2 ...
55. We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
56. We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
57. We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
58. 0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
59. 0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
60. 0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number
Prec = True Positives
Number of People
Predicted Delinquent
Sens = True Positives
Number of People
Actually Delinquent
0.61 KGI on Test Set
61. However, not all fields are as informative
Using the number of times past due 60-89 days
we achieve a KGI of 0.5
62. However, not all fields are as informative
Using the number of times past due 60-89 days
we achieve a KGI of 0.5
The approach is naive and could be improved but
our time is better spent on different algorithms
63. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
64. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
66. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
67. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
68. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
75,000
Customers>30
69. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
75,000
Customers>30
Yes
25,000
Customers <30
70. A random forest starts from a decision tree
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
75,000
Customers>30
Yes
25,000
Customers <30
...
71. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
72. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
73. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
Class assignment of a customer is based on how many
of the decision trees “vote” on how to split an attribute
74. A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
We use a large number of trees to not over-fit to the
training data
Class assignment of a customer is based on how many
of the decision trees “vote” on how to split an attribute
75. The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
76. The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
77. The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
Also parallelized with Mahout and Hadoop since
there is no dependence from one tree to the next
78. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
79. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
80. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
1000 trees: 0.850 KGI
81. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
1000 trees: 0.850 KGI
82. A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI
1000 trees: 0.850 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
83. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
84. Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
85. Boosting Trees is similar to a Random Forest
Customer Data
Find the best split in a set of
randomly chosen attributes
Is age <30?
No
Customers
>30 Data
Yes
Customers
<30 Data
...
86. Boosting Trees is similar to a Random Forest
Customer Data
Is age <30?
No
Customers
>30 Data
Yes
Customers
<30 Data
...
Do an exhaustive search
for best split
87. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
88. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
The next tree is then optimized to
fit whatever variability the first
tree didn’t fit
89. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
The next tree is then optimized to
fit whatever variability the first
tree didn’t fit
This is a sequential process in
comparison to the random forest
90. How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The first tree is optimized to minimize
a loss function describing the data
The next tree is then optimized to
fit whatever variability the first
tree didn’t fit
This is a sequential process in
comparison to the random forest
We also run the risk of over-fitting
to the data, thus the learning rate
92. Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation
There are implementations that use Hadoop but it’s
more complicated to achieve the best performance
94. Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
1000 trees, 0.1 Learning: 0.865248 KGI
95. Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8
Learning Rate
0.75
0.8
0.85
KGI
0.2 0.4
96. Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8
Learning Rate
0.75
0.8
0.85
KGI
0.2 0.4
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
97. Moving one step further in complexity
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
0.71-0.8659
Blended
Method
98. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
99. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
Random Forest
100. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
101. Or more accurately an ensemble of
ensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
Gradient Tree Boosting
102. Or more accurately an ensemble of
ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.1
0.5
0.01
0.8
0.7
.
.
.
103. Or more accurately an ensemble of
ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
104. Or more accurately an ensemble of
ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
105. Combine all of the model information
Train Data Probabilities
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
106. Combine all of the model information
Train Data Probabilities
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
Optimize the set of train probabilities
to the known delinquencies
107. Combine all of the model information
Train Data Probabilities
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
Optimize the set of train probabilities
to the known delinquencies
Apply the same weighting scheme to the
set of test data probabilities
108. Implementation can be done in a number of ways
Testing in Python or R is slower, due to the sequential nature
of applying the algorithms
Could be faster parallelized, running each algorithm separately
and combining the results
110. Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
Blended
111. Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
But this performance and the possibility of
additional gains comes at a distinct time cost.
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
Blended
112. Examining the continuum of choices
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classification
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
0.71-0.8659
Blended
Method
0.864
114. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
115. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
However, this algorithm takes the longest to run.
This constraint will apply in testing and validation also
116. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
However, this algorithm takes the longest to run.
This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.
It is quick and easily parallelized
117. What would be best to implement?
There is a large amount of optimization in the
blended method that could be done
However, this algorithm takes the longest to run.
This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.
It is quick and easily parallelized
Gradient Tree Boosting returns the best result and
runs reasonably fast.
It is not as easily parallelized though
118. What would be best to implement?
Random Forests returns a reasonably good result.
It is quick and easily parallelized
Gradient Tree Boosting returns the best result and
runs reasonably fast.
It is not as easily parallelized though
119. Increases in predictive performance have real
business value
Using any of the more complex algorithms we
achieve an increase of 35% in comparison to random
120. Increases in predictive performance have real
business value
Using any of the more complex algorithms we
achieve an increase of 35% in comparison to random
Potential decrease of ~$420k in losses by identifying
customers likely to default in the training set alone