SlideShare a Scribd company logo
Kaggle Gold Medal: Case Study
My part of our team’s 9th place solution
Alon Bochman
October 8th, 2019
Summary: Credit risk model wins a Kaggle gold medal
 Home Credit (HC) is a global
consumer-lender focused on the
unbanked and underbanked
 In late 2018, HC challenged the Kaggle
community to create innovative credit
scoring techniques
 HC provided data on ~350,000
applicants covering applications, credit
bureau files, credit card balances, loan
payment history etc.
 Models were ranked purely on
predictive power as measured by
AUROC on a holdout (test) set of
applicants
Background
 Our team received a gold medal,
ranked 9th out of 7,198 teams
 Ahead of most Kaggle grandmasters
Results
Our Process
Data Exploration
Feature
Engineering
Modeling Ensembling
Challenge • Data provided from 7
sources, not ready-
to-model
• Hundreds of
quantitative and text
features provided
• Uneven data
quality: Many
missing, miscoded
values
• Text features with
high cardinality
• Non-linear
relationships
• Highly unbalanced data:
Only 8% of applicants in
training set
experiencing default
• Test and train set came
from different
populations (time split)
• No single model
strong enough to win
a gold medal
Solution • Leverage data to infer
structure
• Aggregate by
common IDs + using
matching algorithm
• Rapid exploration to
identify important
features
• EDA, domain
expertise and
custom code to
identify feature
interactions
• Public Q&A with
competition host
to understand
underlying process,
dataset
construction
• Leverage non-linear
modeling techniques:
xgboost, lightgbm, knn,
neural networking and
others
• Innovative model
composition with
misclassification-boost
• Adversarial validation
with stratified k-fold CV
• Combine models in a
stacked ensemble
similar to a neural-
network architecture
• Final ensemble
consisted of ~300
models in a 5-layer
stack
Data exploration
- 7 datasets with nested
many-to-one relationships.
For example, can have:
- Many previous
applications from the
applicant and
- Many installment
payments per previous
application
- Key issue
- How to aggregate nested
data into model-able
features
- Our solution
- Many different
aggregations – see feature
Feature engineering
Basic techniques
• Numeric
• Scaling*: Min/Max, z-score,
Exponential, PctRank, Winsorize
• Round precision (how many zeros)
• Binning
• Categorical
• Label, frequency, mean encoding
• One-hot encoding*
• Date
• Dates provided as integers (days
since X)
• Missings
• Flag hidden NaNs
• Count by row, % vs peer group
• Impute with median, mean, model
• Aggregation
• Standard stats
(min/median/sd/kurtosis, etc.)
Created ~3,000 features using domain knowledge, EDA and systematic search. About 80% of total effort was
spent here
Notes
* For non-tree models
Advanced techniques
• Numeric
• Round precision (how many zeros)
• Aggregation
• Stats vs peer group
• Triangulation
• Application vs. bureau
• Current vs. previous
• Bureau 1 vs. 2, etc.
• Interaction search
• Using Spearman rank correlation:
fast and compatible with AUC goal
• Using tree-based models, built-into
CatBoost
• 2-way and 3-way
• Distance
• KNN with PCA (next slide)
Feature selection
• Stepwise selection
• Linear models:
• Lasso regression (at
ensemble level)
• Tree models:
• Split, gain importance (not
great for high-cardinality
features)
• Permutation importance
• Permuting the features (one
feature at a time, best but
expensive)
• Permuting the target (~30
times)
Distance features with KNN
The KNN model contributed many useful features at the base and ensemble levels
Raw Features Scale Reduce (Optional)
Fit Nearest
Neighbor Model
Generate
Meta-features
• Applied to many
feature-sets at parent
and child levels
• Applied at ensemble
level (stacking)
• Treat NULLs
• Numerics: Z-score or 0-1
scaling
• Categoricals: mean-
encode (out of fold)
• PCA
• Non-negative Matrix
Factorization (NMF)
• Required if number of
features > about 5
• Produce vector of
nearest neighbors and
distances for each
observation
• Multiple distance
measures: Euclidian,
Bray-Curtis, Manhattan
• CV, fit out-of-fold
• % of closest K observations in each
class (default or non-default)
where K is [5, 10, 100, 500, 2000,
10,000]
• Largest sequence of consecutive
neighbors of each class within K
closest observations
• Distance to closest neighbor of
each class (default and non-
default)
• Mean distance to neighbors of
each class in closest K
observations
• And many others
Example
Neighbor Distance Target
1 0.1 0
2 0.2 0
3 0.4 1
…
Feature Value
Distance to nearest target=0 0.1
Distance to nearest target=1 0.4
Mean distance to target=0 in top 3
neighbors
0.15
….
KNN output Meta-features
Advantages of this approach
• Non-parametric
• Non-linear
• Complementary: Can identify complex patterns
other models can’t
• Relatively inexpensive, computationally
Modeling
Conducted 375 experiments using 11 model types. GBMs were most effective for this problem
Models built
• GLM
• Logistic
• Distance
• KNN
• SVM*
• Deep Learning
• Keras /
TensorFlow
• Tree
• RandomForest
• ExtraTrees*
• XGBoost
• LightGBM
• CatBoost
• AutoML
• H2O Driverless AI
• TPOT*
• Others*
Learnings
• Gradient boosting machines (GBMs) were most
effective for this type of problem due to
• Structured data of moderate size (<1m
observations)
• Nonlinear, complex relationships
• Useful for feature selection
• All model types contributed to winning solution by
creating diversity (see footnote for exceptions)
• GBMs and NNs benefitted from
• Bayesian parameter optimization
• Early stopping (iterations / epochs)
• Bagging (5-10x with random seeds)
• Despite vendor claims, AutoML solutions were
unimpressive at the time (June-August 2018)
• H2OAI had moderate standalone performance,
but added value to the ensemble
GBMs
Note
* Did not contribute significantly to winning solution
Modeling many-to-one relationships
A key solution was to model at the most granular (child) level and aggregate the predictions to the parent level.
This solution worked for nested relationships as well, such as bureau_balance to bureau to application
App ID
Prev App
ID
Prev App
Features…
App1 Prev1,
Prev2
App2 Prev3
Parent level: application.csv
Child level: previous_application.csv
Applicant 1
had two
previous
application
s: P1 and
P2
App
ID
Prev
ID
Prev App
Features… Target Pred
App1 Prev1 0 0.1
App1 Prev2 0 0.2
App2 Prev3 1 0.9
X Y
• Merge in target variable from
parent level (application) using
SQL JOIN
• Fit model at the child level
(previous_application)
Predict at child level
App
ID
Mean
Pred
Min
Pred
Other
Stats…
App1 0.15 0.1
App2 0.9 0.9
• Group predictions by parent
level (App ID)
• Calculate statistics such as mean,
median, max, stdev, skewness,
kurtosis, 5th percentile, first, last,
trend, etc.
Aggregate to parent level
YY
Stats on grouped by App ID
Ensembling
Ensembling and stacking are critical to most Kaggle contests. The key is to identify uncorrelated but strong
classifiers
Classifier Prediction Accuracy
A 1111111100 80%
B 1111111100 80%
C 1011111100 70%
Majority 1111111100 80%
Ground Truth 1111111111
Classifier Prediction Accuracy
D 1111111100 80%
E 0111011101 70%
F 1000101111 60%
Majority 1111111101 90%
Correlated: No benefit Less correlated: High
benefit
Source: MLWave.com
“Majority vote” is a simple
ensemble. We can get
fancier with:
• Weighted average
• Rank average
• L2 model (stack)
• And so on
Stacking
Our solution consisted of ~300 models in a 5 layer stack
Feature-set
X1
7 raw
dataset
s
Feature-set
X2
Feature-set
X22
Model
L1.1
Model
L1.2
Model
L1.11
95 L1
Prediction
s
Model
L2.1
Model
L2.2
Model
L2.5
Model L5
Predictions submitted to Kaggle
L2
Prediction
s
~3,000
features in
total
• Stratified 5-fold cross
validation
• Out-of-fold predictions
gathered as L1 dataset
• Bayesian and random
parameter optimization
• Feature engineering and
selection applied iteratively
At higher levels of the
stack, we used
• Fewer models
• Simpler models,
and/or
• Stronger
regularization
Data
Model
Continued adding
levels until local CV
score stopped
improving
meaningfully
Cross-model boosting
We were able to extract additional signal by stacking two models together using an algorithm similar to gradient
boosting, with a focus on misclassification errors. This approach is novel as far as we know
FPR
TPR
Motivation
AUC
Worst
false
positives
Worst
false
negative
s
• Not all errors are equally important to
fix. Some are more “expensive” to the
AUC
• We would like to create / select feature
(sets) that add to our signal
Execution
Procedure
1. Fit classifier A on feature set X1. Get
out-of-fold predictions
2. Tag worst 10% false positives as
class=1, rest as class=0
3. Fit classifier B on feature set X2 to
predict this class. Call this the FP
model
4. Similarly, create the FN model
5. Fit classifier C on (OOF) predictions
from models A, FP and FN and their
interactions (several variations
possible)
Rationale
• Classifier B picks X2 features most
complementary to X1
• Classifier C fixes A’s worst errors
• But: watch out for overfitting
Worst 10%
false positives:
“Safe-looking
borrowers
that
defaulted”
Worst 10%
false
negatives:
“Risky-looking
borrowers
that did not
default”
Model A classification error (Y −
Y)
Team coordination
Problems
• 8 teammates in 8 timezones. 2-3 awake & online at any given
time
• Varying skill levels, experience with Kaggle, strengths,
availability
• Everyone is a volunteer, not like at work
• Limited number of submissions allowed – must be
coordinated
• Infinite work, limited time – just like in real life
Our solution
• Slack!
• Theme-specific channels helped focus the discussion
• Shared validation scheme: stratified 5-fold, shared random
seed. Produces comparable results, OOFs for stacking
• Everyone announces their work direction and progress
• OOFs, stacking datasets posted for full team to use
• Much room for improvement, here and elsewhere
Feature
engineering
Reusable code
(poor man’s
git)
Out-of-fold
predictions for
stacking
Intro for new
team
members
Coordinating
submissions
What didn’t work, what we missed
Despite our gold-medal finish, many of our experiments did not work, and we learned a great deal from other top
teams
What didn’t work
• Auto-regressive approach on timeseries
• Binning the timeseries
• Symbolic regression w/genetic feature
generation (DEAP)
• AUC oracle probing
• NMF factorization
• T-SNE projection
• Learning rate decay
• Different number of folds (3, 10, 15)
• Adversarial validation
What we missed
• Interest rate imputation
• Better feature selection: 3rd place solution
used just 150 features
• DAE + NN (component of 1st place solution)
• Encoding payment history as an “image” and
running it through a CNN
• Same borrowers with different IDs
• And many more…
THANK YOU!
Special thanks to my fantastic team: Michael Penrose, Corey Levinson, Sai Suchith
Mahajan, Misha Lisovyi, Tom Aindow and Zipp!
APPENDIX
Data Exploration: Selected Findings
Finding Implication
Only 8% of borrowers defaulted in training set Stratified k-fold validation scheme
Extreme outliers in time-based variables, e.g. 1000-year
employment history
• They default less often (5.4% vs. 8.7%)
• Similar effect with some income outliers, e.g. $10M
annual income
Encode outliers as NULLs
Encode outlier flags for algos that aren’t NULL-friendly
(eg. GLM, Scikit’s RandomForest)
Up to 70% missing data in certain variables Create features on missing data (count, groupby)
Certain categorical variables with high cardinality Frequency encoding
Mean encoding (out-of-fold)
Text-processing to create lower-cardinality groups
Selected high-value (top 1%) engineered features
Feature Why it was useful
Credit requested / annual loan payment Loan duration. Longer loans are riskier, all else equal
Variance of (debt / credit) reported by credit
bureaus
• Debt / credit is a proxy for borrowing flexibility, i.e. financial
capacity
• When bureaus paint a consistent picture, the applicant is
better-known and safer
Financial product (card, revolving loan, line of
credit, etc.) applied-for in most recent
application
More predictive than an aggregation of full application history
Unweighted mean of all mean encodings by row Mean reduced variance of mean-encodings
Minimum of all mean encodings by row Added sensitivity for borderline applicants, similar to worst-
case-scenario
(Proposed purchase price / credit requested)
ranked within groups defined by whether a work
phone was provided
• Ratio normalized for different income levels. NULLs also
predictive
• Ranking corrected for non-linearity
• Grouping made comparison more fair. Within-group rank
more predictive than across-group
Simpler
More
Complex
Out of fold predictions
Our 5-fold validation scheme allowed us to create out-of-fold predictions for each model
1. Split train set into 5
folds (stratified)
2. Train model L1.1 on
folds 2 through 5.
Predict on fold 1.
These are out-of-fold
predictions for fold 1.
Save model weights
3. Repeat #2 to create
out-of-fold predictions
for all folds
4. Average trained
models for all folds to
predict the test set
Learn
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Learn
Train
Set
Test
Set
Predict Predict Predict Predict Predict
Predict
Predict
Predict
Predict
Predict
Predict
Mean
Meta-featureFold 1 Fold 2 Fold 3 Fold 4 Fold 5
Stacking
We concatenate L1 predictions as columns into the L1 dataset. Then, we can fit an L2 stacking model on top of it.
The process repeats for higher levels
ID TARGET Logistic Random
Forest
LightGBM Neural
Network
Others…
Train
Set
Test
Set
Meta-features
prepared as in
previous slide
Level 2
Model
(Logistic)
L2
Meta-
Feature
Test
preds
From raw data Predictions from L1
Models
Submitted to Kaggle

More Related Content

What's hot

H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
Sri Ambati
 
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsMatthias Braunhofer
 
Avihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slidesAvihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slides
wolf
 
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature SurveyPareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Abdel Salam Sayyad
 
Automated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web ApplicationsAutomated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web ApplicationsLionel Briand
 
Decision Support Analyss for Software Effort Estimation by Analogy
Decision Support Analyss for Software Effort Estimation by AnalogyDecision Support Analyss for Software Effort Estimation by Analogy
Decision Support Analyss for Software Effort Estimation by Analogy
Tim Menzies
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloud
Anima Anandkumar
 
Techniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start RecommendationsTechniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start Recommendations
Matthias Braunhofer
 
ML Basics
ML BasicsML Basics
ML Basics
SrujanaMerugu1
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
L13. Cluster Analysis
L13. Cluster AnalysisL13. Cluster Analysis
L13. Cluster Analysis
Machine Learning Valencia
 
Linear Regression, Machine learning term
Linear Regression, Machine learning termLinear Regression, Machine learning term
Linear Regression, Machine learning term
S Rulez
 
Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)
SungdoGu
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
Instance Space Analysis for Search Based Software Engineering
Instance Space Analysis for Search Based Software EngineeringInstance Space Analysis for Search Based Software Engineering
Instance Space Analysis for Search Based Software Engineering
Aldeida Aleti
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Learning loss for active learning
Learning loss for active learningLearning loss for active learning
Learning loss for active learning
NAVER Engineering
 
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Kishor Datta Gupta
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 

What's hot (20)

H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
 
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
 
Avihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slidesAvihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slides
 
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature SurveyPareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
 
Automated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web ApplicationsAutomated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web Applications
 
Decision Support Analyss for Software Effort Estimation by Analogy
Decision Support Analyss for Software Effort Estimation by AnalogyDecision Support Analyss for Software Effort Estimation by Analogy
Decision Support Analyss for Software Effort Estimation by Analogy
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloud
 
Test design techniques
Test design techniquesTest design techniques
Test design techniques
 
Techniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start RecommendationsTechniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start Recommendations
 
ML Basics
ML BasicsML Basics
ML Basics
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
L13. Cluster Analysis
L13. Cluster AnalysisL13. Cluster Analysis
L13. Cluster Analysis
 
Linear Regression, Machine learning term
Linear Regression, Machine learning termLinear Regression, Machine learning term
Linear Regression, Machine learning term
 
Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Instance Space Analysis for Search Based Software Engineering
Instance Space Analysis for Search Based Software EngineeringInstance Space Analysis for Search Based Software Engineering
Instance Space Analysis for Search Based Software Engineering
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Learning loss for active learning
Learning loss for active learningLearning loss for active learning
Learning loss for active learning
 
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 

Similar to Kaggle Gold Medal Case Study

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
Anuj Gupta
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
PriyadharshiniG41
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
Si Krishan
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
SrushtiSuvarna
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on Premises
Ivo Andreev
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Egyptian Engineers Association
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
DhanuDhanu49
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
Soumya Mukherjee
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
Davis David
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 

Similar to Kaggle Gold Medal Case Study (20)

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on Premises
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 

Recently uploaded

Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdfSearch Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Arihant Webtech Pvt. Ltd
 
The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...
awaisafdar
 
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
BBPMedia1
 
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
BBPMedia1
 
Unveiling the Secrets How Does Generative AI Work.pdf
Unveiling the Secrets How Does Generative AI Work.pdfUnveiling the Secrets How Does Generative AI Work.pdf
Unveiling the Secrets How Does Generative AI Work.pdf
Sam H
 
The effects of customers service quality and online reviews on customer loyal...
The effects of customers service quality and online reviews on customer loyal...The effects of customers service quality and online reviews on customer loyal...
The effects of customers service quality and online reviews on customer loyal...
balatucanapplelovely
 
20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf
tjcomstrang
 
Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...
dylandmeas
 
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Lviv Startup Club
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
SynapseIndia
 
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdfikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
agatadrynko
 
Cracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptxCracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptx
Workforce Group
 
Improving profitability for small business
Improving profitability for small businessImproving profitability for small business
Improving profitability for small business
Ben Wann
 
The-McKinsey-7S-Framework. strategic management
The-McKinsey-7S-Framework. strategic managementThe-McKinsey-7S-Framework. strategic management
The-McKinsey-7S-Framework. strategic management
Bojamma2
 
Buy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star ReviewsBuy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star Reviews
usawebmarket
 
Project File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdfProject File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdf
RajPriye
 
Attending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learnersAttending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learners
Erika906060
 
What is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdfWhat is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdf
seoforlegalpillers
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
seri bangash
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
Operational Excellence Consulting
 

Recently uploaded (20)

Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdfSearch Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
 
The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...
 
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
 
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
RMD24 | Debunking the non-endemic revenue myth Marvin Vacquier Droop | First ...
 
Unveiling the Secrets How Does Generative AI Work.pdf
Unveiling the Secrets How Does Generative AI Work.pdfUnveiling the Secrets How Does Generative AI Work.pdf
Unveiling the Secrets How Does Generative AI Work.pdf
 
The effects of customers service quality and online reviews on customer loyal...
The effects of customers service quality and online reviews on customer loyal...The effects of customers service quality and online reviews on customer loyal...
The effects of customers service quality and online reviews on customer loyal...
 
20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf
 
Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...
 
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
 
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdfikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
 
Cracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptxCracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptx
 
Improving profitability for small business
Improving profitability for small businessImproving profitability for small business
Improving profitability for small business
 
The-McKinsey-7S-Framework. strategic management
The-McKinsey-7S-Framework. strategic managementThe-McKinsey-7S-Framework. strategic management
The-McKinsey-7S-Framework. strategic management
 
Buy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star ReviewsBuy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star Reviews
 
Project File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdfProject File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdf
 
Attending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learnersAttending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learners
 
What is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdfWhat is the TDS Return Filing Due Date for FY 2024-25.pdf
What is the TDS Return Filing Due Date for FY 2024-25.pdf
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
 

Kaggle Gold Medal Case Study

  • 1. Kaggle Gold Medal: Case Study My part of our team’s 9th place solution Alon Bochman October 8th, 2019
  • 2. Summary: Credit risk model wins a Kaggle gold medal  Home Credit (HC) is a global consumer-lender focused on the unbanked and underbanked  In late 2018, HC challenged the Kaggle community to create innovative credit scoring techniques  HC provided data on ~350,000 applicants covering applications, credit bureau files, credit card balances, loan payment history etc.  Models were ranked purely on predictive power as measured by AUROC on a holdout (test) set of applicants Background  Our team received a gold medal, ranked 9th out of 7,198 teams  Ahead of most Kaggle grandmasters Results Our Process Data Exploration Feature Engineering Modeling Ensembling Challenge • Data provided from 7 sources, not ready- to-model • Hundreds of quantitative and text features provided • Uneven data quality: Many missing, miscoded values • Text features with high cardinality • Non-linear relationships • Highly unbalanced data: Only 8% of applicants in training set experiencing default • Test and train set came from different populations (time split) • No single model strong enough to win a gold medal Solution • Leverage data to infer structure • Aggregate by common IDs + using matching algorithm • Rapid exploration to identify important features • EDA, domain expertise and custom code to identify feature interactions • Public Q&A with competition host to understand underlying process, dataset construction • Leverage non-linear modeling techniques: xgboost, lightgbm, knn, neural networking and others • Innovative model composition with misclassification-boost • Adversarial validation with stratified k-fold CV • Combine models in a stacked ensemble similar to a neural- network architecture • Final ensemble consisted of ~300 models in a 5-layer stack
  • 3. Data exploration - 7 datasets with nested many-to-one relationships. For example, can have: - Many previous applications from the applicant and - Many installment payments per previous application - Key issue - How to aggregate nested data into model-able features - Our solution - Many different aggregations – see feature
  • 4. Feature engineering Basic techniques • Numeric • Scaling*: Min/Max, z-score, Exponential, PctRank, Winsorize • Round precision (how many zeros) • Binning • Categorical • Label, frequency, mean encoding • One-hot encoding* • Date • Dates provided as integers (days since X) • Missings • Flag hidden NaNs • Count by row, % vs peer group • Impute with median, mean, model • Aggregation • Standard stats (min/median/sd/kurtosis, etc.) Created ~3,000 features using domain knowledge, EDA and systematic search. About 80% of total effort was spent here Notes * For non-tree models Advanced techniques • Numeric • Round precision (how many zeros) • Aggregation • Stats vs peer group • Triangulation • Application vs. bureau • Current vs. previous • Bureau 1 vs. 2, etc. • Interaction search • Using Spearman rank correlation: fast and compatible with AUC goal • Using tree-based models, built-into CatBoost • 2-way and 3-way • Distance • KNN with PCA (next slide) Feature selection • Stepwise selection • Linear models: • Lasso regression (at ensemble level) • Tree models: • Split, gain importance (not great for high-cardinality features) • Permutation importance • Permuting the features (one feature at a time, best but expensive) • Permuting the target (~30 times)
  • 5. Distance features with KNN The KNN model contributed many useful features at the base and ensemble levels Raw Features Scale Reduce (Optional) Fit Nearest Neighbor Model Generate Meta-features • Applied to many feature-sets at parent and child levels • Applied at ensemble level (stacking) • Treat NULLs • Numerics: Z-score or 0-1 scaling • Categoricals: mean- encode (out of fold) • PCA • Non-negative Matrix Factorization (NMF) • Required if number of features > about 5 • Produce vector of nearest neighbors and distances for each observation • Multiple distance measures: Euclidian, Bray-Curtis, Manhattan • CV, fit out-of-fold • % of closest K observations in each class (default or non-default) where K is [5, 10, 100, 500, 2000, 10,000] • Largest sequence of consecutive neighbors of each class within K closest observations • Distance to closest neighbor of each class (default and non- default) • Mean distance to neighbors of each class in closest K observations • And many others Example Neighbor Distance Target 1 0.1 0 2 0.2 0 3 0.4 1 … Feature Value Distance to nearest target=0 0.1 Distance to nearest target=1 0.4 Mean distance to target=0 in top 3 neighbors 0.15 …. KNN output Meta-features Advantages of this approach • Non-parametric • Non-linear • Complementary: Can identify complex patterns other models can’t • Relatively inexpensive, computationally
  • 6. Modeling Conducted 375 experiments using 11 model types. GBMs were most effective for this problem Models built • GLM • Logistic • Distance • KNN • SVM* • Deep Learning • Keras / TensorFlow • Tree • RandomForest • ExtraTrees* • XGBoost • LightGBM • CatBoost • AutoML • H2O Driverless AI • TPOT* • Others* Learnings • Gradient boosting machines (GBMs) were most effective for this type of problem due to • Structured data of moderate size (<1m observations) • Nonlinear, complex relationships • Useful for feature selection • All model types contributed to winning solution by creating diversity (see footnote for exceptions) • GBMs and NNs benefitted from • Bayesian parameter optimization • Early stopping (iterations / epochs) • Bagging (5-10x with random seeds) • Despite vendor claims, AutoML solutions were unimpressive at the time (June-August 2018) • H2OAI had moderate standalone performance, but added value to the ensemble GBMs Note * Did not contribute significantly to winning solution
  • 7. Modeling many-to-one relationships A key solution was to model at the most granular (child) level and aggregate the predictions to the parent level. This solution worked for nested relationships as well, such as bureau_balance to bureau to application App ID Prev App ID Prev App Features… App1 Prev1, Prev2 App2 Prev3 Parent level: application.csv Child level: previous_application.csv Applicant 1 had two previous application s: P1 and P2 App ID Prev ID Prev App Features… Target Pred App1 Prev1 0 0.1 App1 Prev2 0 0.2 App2 Prev3 1 0.9 X Y • Merge in target variable from parent level (application) using SQL JOIN • Fit model at the child level (previous_application) Predict at child level App ID Mean Pred Min Pred Other Stats… App1 0.15 0.1 App2 0.9 0.9 • Group predictions by parent level (App ID) • Calculate statistics such as mean, median, max, stdev, skewness, kurtosis, 5th percentile, first, last, trend, etc. Aggregate to parent level YY Stats on grouped by App ID
  • 8. Ensembling Ensembling and stacking are critical to most Kaggle contests. The key is to identify uncorrelated but strong classifiers Classifier Prediction Accuracy A 1111111100 80% B 1111111100 80% C 1011111100 70% Majority 1111111100 80% Ground Truth 1111111111 Classifier Prediction Accuracy D 1111111100 80% E 0111011101 70% F 1000101111 60% Majority 1111111101 90% Correlated: No benefit Less correlated: High benefit Source: MLWave.com “Majority vote” is a simple ensemble. We can get fancier with: • Weighted average • Rank average • L2 model (stack) • And so on
  • 9. Stacking Our solution consisted of ~300 models in a 5 layer stack Feature-set X1 7 raw dataset s Feature-set X2 Feature-set X22 Model L1.1 Model L1.2 Model L1.11 95 L1 Prediction s Model L2.1 Model L2.2 Model L2.5 Model L5 Predictions submitted to Kaggle L2 Prediction s ~3,000 features in total • Stratified 5-fold cross validation • Out-of-fold predictions gathered as L1 dataset • Bayesian and random parameter optimization • Feature engineering and selection applied iteratively At higher levels of the stack, we used • Fewer models • Simpler models, and/or • Stronger regularization Data Model Continued adding levels until local CV score stopped improving meaningfully
  • 10. Cross-model boosting We were able to extract additional signal by stacking two models together using an algorithm similar to gradient boosting, with a focus on misclassification errors. This approach is novel as far as we know FPR TPR Motivation AUC Worst false positives Worst false negative s • Not all errors are equally important to fix. Some are more “expensive” to the AUC • We would like to create / select feature (sets) that add to our signal Execution Procedure 1. Fit classifier A on feature set X1. Get out-of-fold predictions 2. Tag worst 10% false positives as class=1, rest as class=0 3. Fit classifier B on feature set X2 to predict this class. Call this the FP model 4. Similarly, create the FN model 5. Fit classifier C on (OOF) predictions from models A, FP and FN and their interactions (several variations possible) Rationale • Classifier B picks X2 features most complementary to X1 • Classifier C fixes A’s worst errors • But: watch out for overfitting Worst 10% false positives: “Safe-looking borrowers that defaulted” Worst 10% false negatives: “Risky-looking borrowers that did not default” Model A classification error (Y − Y)
  • 11. Team coordination Problems • 8 teammates in 8 timezones. 2-3 awake & online at any given time • Varying skill levels, experience with Kaggle, strengths, availability • Everyone is a volunteer, not like at work • Limited number of submissions allowed – must be coordinated • Infinite work, limited time – just like in real life Our solution • Slack! • Theme-specific channels helped focus the discussion • Shared validation scheme: stratified 5-fold, shared random seed. Produces comparable results, OOFs for stacking • Everyone announces their work direction and progress • OOFs, stacking datasets posted for full team to use • Much room for improvement, here and elsewhere Feature engineering Reusable code (poor man’s git) Out-of-fold predictions for stacking Intro for new team members Coordinating submissions
  • 12. What didn’t work, what we missed Despite our gold-medal finish, many of our experiments did not work, and we learned a great deal from other top teams What didn’t work • Auto-regressive approach on timeseries • Binning the timeseries • Symbolic regression w/genetic feature generation (DEAP) • AUC oracle probing • NMF factorization • T-SNE projection • Learning rate decay • Different number of folds (3, 10, 15) • Adversarial validation What we missed • Interest rate imputation • Better feature selection: 3rd place solution used just 150 features • DAE + NN (component of 1st place solution) • Encoding payment history as an “image” and running it through a CNN • Same borrowers with different IDs • And many more…
  • 13. THANK YOU! Special thanks to my fantastic team: Michael Penrose, Corey Levinson, Sai Suchith Mahajan, Misha Lisovyi, Tom Aindow and Zipp!
  • 15. Data Exploration: Selected Findings Finding Implication Only 8% of borrowers defaulted in training set Stratified k-fold validation scheme Extreme outliers in time-based variables, e.g. 1000-year employment history • They default less often (5.4% vs. 8.7%) • Similar effect with some income outliers, e.g. $10M annual income Encode outliers as NULLs Encode outlier flags for algos that aren’t NULL-friendly (eg. GLM, Scikit’s RandomForest) Up to 70% missing data in certain variables Create features on missing data (count, groupby) Certain categorical variables with high cardinality Frequency encoding Mean encoding (out-of-fold) Text-processing to create lower-cardinality groups
  • 16. Selected high-value (top 1%) engineered features Feature Why it was useful Credit requested / annual loan payment Loan duration. Longer loans are riskier, all else equal Variance of (debt / credit) reported by credit bureaus • Debt / credit is a proxy for borrowing flexibility, i.e. financial capacity • When bureaus paint a consistent picture, the applicant is better-known and safer Financial product (card, revolving loan, line of credit, etc.) applied-for in most recent application More predictive than an aggregation of full application history Unweighted mean of all mean encodings by row Mean reduced variance of mean-encodings Minimum of all mean encodings by row Added sensitivity for borderline applicants, similar to worst- case-scenario (Proposed purchase price / credit requested) ranked within groups defined by whether a work phone was provided • Ratio normalized for different income levels. NULLs also predictive • Ranking corrected for non-linearity • Grouping made comparison more fair. Within-group rank more predictive than across-group Simpler More Complex
  • 17. Out of fold predictions Our 5-fold validation scheme allowed us to create out-of-fold predictions for each model 1. Split train set into 5 folds (stratified) 2. Train model L1.1 on folds 2 through 5. Predict on fold 1. These are out-of-fold predictions for fold 1. Save model weights 3. Repeat #2 to create out-of-fold predictions for all folds 4. Average trained models for all folds to predict the test set Learn Learn Learn Learn Predict Learn Learn Learn Predict Learn Learn Learn Predict Learn Learn Learn Predict Learn Learn Learn Predict Learn Learn Learn Learn Train Set Test Set Predict Predict Predict Predict Predict Predict Predict Predict Predict Predict Predict Mean Meta-featureFold 1 Fold 2 Fold 3 Fold 4 Fold 5
  • 18. Stacking We concatenate L1 predictions as columns into the L1 dataset. Then, we can fit an L2 stacking model on top of it. The process repeats for higher levels ID TARGET Logistic Random Forest LightGBM Neural Network Others… Train Set Test Set Meta-features prepared as in previous slide Level 2 Model (Logistic) L2 Meta- Feature Test preds From raw data Predictions from L1 Models Submitted to Kaggle

Editor's Notes

  1. Why Kaggle? Largest DS community globally with >120k ranked competitors Great place to learn: Competition and cooperation Time and resource constraints just like with real project work Level playing field + objective performance evaluation Focus on what works in practice vs. theory Can evaluate competing algorithms, pipelines, scientists Competition pushes the envelope How much signal can we possibly squeeze from the data? Can sometimes advance the state of the art