Kaggle Gold Medal Case Study

Kaggle Gold Medal: Case Study
My part of our team’s 9th place solution
Alon Bochman
October 8th, 2019

Summary: Credit risk model wins a Kaggle gold medal
 Home Credit (HC) is a global
consumer-lender focused on the
unbanked and underbanked
 In late 2018, HC challenged the Kaggle
community to create innovative credit
scoring techniques
 HC provided data on ~350,000
applicants covering applications, credit
bureau files, credit card balances, loan
payment history etc.
 Models were ranked purely on
predictive power as measured by
AUROC on a holdout (test) set of
applicants
Background
 Our team received a gold medal,
ranked 9th out of 7,198 teams
 Ahead of most Kaggle grandmasters
Results
Our Process
Data Exploration
Feature
Engineering
Modeling Ensembling
Challenge • Data provided from 7
sources, not ready-
to-model
• Hundreds of
quantitative and text
features provided
• Uneven data
quality: Many
missing, miscoded
values
• Text features with
high cardinality
• Non-linear
relationships
• Highly unbalanced data:
Only 8% of applicants in
training set
experiencing default
• Test and train set came
from different
populations (time split)
• No single model
strong enough to win
a gold medal
Solution • Leverage data to infer
structure
• Aggregate by
common IDs + using
matching algorithm
• Rapid exploration to
identify important
features
• EDA, domain
expertise and
custom code to
identify feature
interactions
• Public Q&A with
competition host
to understand
underlying process,
dataset
construction
• Leverage non-linear
modeling techniques:
xgboost, lightgbm, knn,
neural networking and
others
• Innovative model
composition with
misclassification-boost
• Adversarial validation
with stratified k-fold CV
• Combine models in a
stacked ensemble
similar to a neural-
network architecture
• Final ensemble
consisted of ~300
models in a 5-layer
stack

Data exploration
- 7 datasets with nested
many-to-one relationships.
For example, can have:
- Many previous
applications from the
applicant and
- Many installment
payments per previous
application
- Key issue
- How to aggregate nested
data into model-able
features
- Our solution
- Many different
aggregations – see feature

Feature engineering
Basic techniques
• Numeric
• Scaling*: Min/Max, z-score,
Exponential, PctRank, Winsorize
• Round precision (how many zeros)
• Binning
• Categorical
• Label, frequency, mean encoding
• One-hot encoding*
• Date
• Dates provided as integers (days
since X)
• Missings
• Flag hidden NaNs
• Count by row, % vs peer group
• Impute with median, mean, model
• Aggregation
• Standard stats
(min/median/sd/kurtosis, etc.)
Created ~3,000 features using domain knowledge, EDA and systematic search. About 80% of total effort was
spent here
Notes
* For non-tree models
Advanced techniques
• Numeric
• Round precision (how many zeros)
• Aggregation
• Stats vs peer group
• Triangulation
• Application vs. bureau
• Current vs. previous
• Bureau 1 vs. 2, etc.
• Interaction search
• Using Spearman rank correlation:
fast and compatible with AUC goal
• Using tree-based models, built-into
CatBoost
• 2-way and 3-way
• Distance
• KNN with PCA (next slide)
Feature selection
• Stepwise selection
• Linear models:
• Lasso regression (at
ensemble level)
• Tree models:
• Split, gain importance (not
great for high-cardinality
features)
• Permutation importance
• Permuting the features (one
feature at a time, best but
expensive)
• Permuting the target (~30
times)

Distance features with KNN
The KNN model contributed many useful features at the base and ensemble levels
Raw Features Scale Reduce (Optional)
Fit Nearest
Neighbor Model
Generate
Meta-features
• Applied to many
feature-sets at parent
and child levels
• Applied at ensemble
level (stacking)
• Treat NULLs
• Numerics: Z-score or 0-1
scaling
• Categoricals: mean-
encode (out of fold)
• PCA
• Non-negative Matrix
Factorization (NMF)
• Required if number of
features > about 5
• Produce vector of
nearest neighbors and
distances for each
observation
• Multiple distance
measures: Euclidian,
Bray-Curtis, Manhattan
• CV, fit out-of-fold
• % of closest K observations in each
class (default or non-default)
where K is [5, 10, 100, 500, 2000,
10,000]
• Largest sequence of consecutive
neighbors of each class within K
closest observations
• Distance to closest neighbor of
each class (default and non-
default)
• Mean distance to neighbors of
each class in closest K
observations
• And many others
Example
Neighbor Distance Target
1 0.1 0
2 0.2 0
3 0.4 1
…
Feature Value
Distance to nearest target=0 0.1
Distance to nearest target=1 0.4
Mean distance to target=0 in top 3
neighbors
0.15
….
KNN output Meta-features
Advantages of this approach
• Non-parametric
• Non-linear
• Complementary: Can identify complex patterns
other models can’t
• Relatively inexpensive, computationally

Modeling
Conducted 375 experiments using 11 model types. GBMs were most effective for this problem
Models built
• GLM
• Logistic
• Distance
• KNN
• SVM*
• Deep Learning
• Keras /
TensorFlow
• Tree
• RandomForest
• ExtraTrees*
• XGBoost
• LightGBM
• CatBoost
• AutoML
• H2O Driverless AI
• TPOT*
• Others*
Learnings
• Gradient boosting machines (GBMs) were most
effective for this type of problem due to
• Structured data of moderate size (<1m
observations)
• Nonlinear, complex relationships
• Useful for feature selection
• All model types contributed to winning solution by
creating diversity (see footnote for exceptions)
• GBMs and NNs benefitted from
• Bayesian parameter optimization
• Early stopping (iterations / epochs)
• Bagging (5-10x with random seeds)
• Despite vendor claims, AutoML solutions were
unimpressive at the time (June-August 2018)
• H2OAI had moderate standalone performance,
but added value to the ensemble
GBMs
Note
* Did not contribute significantly to winning solution

Modeling many-to-one relationships
A key solution was to model at the most granular (child) level and aggregate the predictions to the parent level.
This solution worked for nested relationships as well, such as bureau_balance to bureau to application
App ID
Prev App
ID
Prev App
Features…
App1 Prev1,
Prev2
App2 Prev3
Parent level: application.csv
Child level: previous_application.csv
Applicant 1
had two
previous
application
s: P1 and
P2
App
ID
Prev
ID
Prev App
Features… Target Pred
App1 Prev1 0 0.1
App1 Prev2 0 0.2
App2 Prev3 1 0.9
X Y
• Merge in target variable from
parent level (application) using
SQL JOIN
• Fit model at the child level
(previous_application)
Predict at child level
App
ID
Mean
Pred
Min
Pred
Other
Stats…
App1 0.15 0.1
App2 0.9 0.9
• Group predictions by parent
level (App ID)
• Calculate statistics such as mean,
median, max, stdev, skewness,
kurtosis, 5th percentile, first, last,
trend, etc.
Aggregate to parent level
YY
Stats on grouped by App ID

Ensembling
Ensembling and stacking are critical to most Kaggle contests. The key is to identify uncorrelated but strong
classifiers
Classifier Prediction Accuracy
A 1111111100 80%
B 1111111100 80%
C 1011111100 70%
Majority 1111111100 80%
Ground Truth 1111111111
Classifier Prediction Accuracy
D 1111111100 80%
E 0111011101 70%
F 1000101111 60%
Majority 1111111101 90%
Correlated: No benefit Less correlated: High
benefit
Source: MLWave.com
“Majority vote” is a simple
ensemble. We can get
fancier with:
• Weighted average
• Rank average
• L2 model (stack)
• And so on

Stacking
Our solution consisted of ~300 models in a 5 layer stack
Feature-set
X1
7 raw
dataset
s
Feature-set
X2
Feature-set
X22
Model
L1.1
Model
L1.2
Model
L1.11
95 L1
Prediction
s
Model
L2.1
Model
L2.2
Model
L2.5
Model L5
Predictions submitted to Kaggle
L2
Prediction
s
~3,000
features in
total
• Stratified 5-fold cross
validation
• Out-of-fold predictions
gathered as L1 dataset
• Bayesian and random
parameter optimization
• Feature engineering and
selection applied iteratively
At higher levels of the
stack, we used
• Fewer models
• Simpler models,
and/or
• Stronger
regularization
Data
Model
Continued adding
levels until local CV
score stopped
improving
meaningfully

Cross-model boosting
We were able to extract additional signal by stacking two models together using an algorithm similar to gradient
boosting, with a focus on misclassification errors. This approach is novel as far as we know
FPR
TPR
Motivation
AUC
Worst
false
positives
Worst
false
negative
s
• Not all errors are equally important to
fix. Some are more “expensive” to the
AUC
• We would like to create / select feature
(sets) that add to our signal
Execution
Procedure
1. Fit classifier A on feature set X1. Get
out-of-fold predictions
2. Tag worst 10% false positives as
class=1, rest as class=0
3. Fit classifier B on feature set X2 to
predict this class. Call this the FP
model
4. Similarly, create the FN model
5. Fit classifier C on (OOF) predictions
from models A, FP and FN and their
interactions (several variations
possible)
Rationale
• Classifier B picks X2 features most
complementary to X1
• Classifier C fixes A’s worst errors
• But: watch out for overfitting
Worst 10%
false positives:
“Safe-looking
borrowers
that
defaulted”
Worst 10%
false
negatives:
“Risky-looking
borrowers
that did not
default”
Model A classification error (Y −
Y)

Team coordination
Problems
• 8 teammates in 8 timezones. 2-3 awake & online at any given
time
• Varying skill levels, experience with Kaggle, strengths,
availability
• Everyone is a volunteer, not like at work
• Limited number of submissions allowed – must be
coordinated
• Infinite work, limited time – just like in real life
Our solution
• Slack!
• Theme-specific channels helped focus the discussion
• Shared validation scheme: stratified 5-fold, shared random
seed. Produces comparable results, OOFs for stacking
• Everyone announces their work direction and progress
• OOFs, stacking datasets posted for full team to use
• Much room for improvement, here and elsewhere
Feature
engineering
Reusable code
(poor man’s
git)
Out-of-fold
predictions for
stacking
Intro for new
team
members
Coordinating
submissions

What didn’t work, what we missed
Despite our gold-medal finish, many of our experiments did not work, and we learned a great deal from other top
teams
What didn’t work
• Auto-regressive approach on timeseries
• Binning the timeseries
• Symbolic regression w/genetic feature
generation (DEAP)
• AUC oracle probing
• NMF factorization
• T-SNE projection
• Learning rate decay
• Different number of folds (3, 10, 15)
• Adversarial validation
What we missed
• Interest rate imputation
• Better feature selection: 3rd place solution
used just 150 features
• DAE + NN (component of 1st place solution)
• Encoding payment history as an “image” and
running it through a CNN
• Same borrowers with different IDs
• And many more…

THANK YOU!
Special thanks to my fantastic team: Michael Penrose, Corey Levinson, Sai Suchith
Mahajan, Misha Lisovyi, Tom Aindow and Zipp!

Data Exploration: Selected Findings
Finding Implication
Only 8% of borrowers defaulted in training set Stratified k-fold validation scheme
Extreme outliers in time-based variables, e.g. 1000-year
employment history
• They default less often (5.4% vs. 8.7%)
• Similar effect with some income outliers, e.g. $10M
annual income
Encode outliers as NULLs
Encode outlier flags for algos that aren’t NULL-friendly
(eg. GLM, Scikit’s RandomForest)
Up to 70% missing data in certain variables Create features on missing data (count, groupby)
Certain categorical variables with high cardinality Frequency encoding
Mean encoding (out-of-fold)
Text-processing to create lower-cardinality groups

Selected high-value (top 1%) engineered features
Feature Why it was useful
Credit requested / annual loan payment Loan duration. Longer loans are riskier, all else equal
Variance of (debt / credit) reported by credit
bureaus
• Debt / credit is a proxy for borrowing flexibility, i.e. financial
capacity
• When bureaus paint a consistent picture, the applicant is
better-known and safer
Financial product (card, revolving loan, line of
credit, etc.) applied-for in most recent
application
More predictive than an aggregation of full application history
Unweighted mean of all mean encodings by row Mean reduced variance of mean-encodings
Minimum of all mean encodings by row Added sensitivity for borderline applicants, similar to worst-
case-scenario
(Proposed purchase price / credit requested)
ranked within groups defined by whether a work
phone was provided
• Ratio normalized for different income levels. NULLs also
predictive
• Ranking corrected for non-linearity
• Grouping made comparison more fair. Within-group rank
more predictive than across-group
Simpler
More
Complex

Out of fold predictions
Our 5-fold validation scheme allowed us to create out-of-fold predictions for each model
1. Split train set into 5
folds (stratified)
2. Train model L1.1 on
folds 2 through 5.
Predict on fold 1.
These are out-of-fold
predictions for fold 1.
Save model weights
3. Repeat #2 to create
out-of-fold predictions
for all folds
4. Average trained
models for all folds to
predict the test set
Learn
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Predict
Learn
Learn
Learn
Learn
Train
Set
Test
Set
Predict Predict Predict Predict Predict
Predict
Predict
Predict
Predict
Predict
Predict
Mean
Meta-featureFold 1 Fold 2 Fold 3 Fold 4 Fold 5

Stacking
We concatenate L1 predictions as columns into the L1 dataset. Then, we can fit an L2 stacking model on top of it.
The process repeats for higher levels
ID TARGET Logistic Random
Forest
LightGBM Neural
Network
Others…
Train
Set
Test
Set
Meta-features
prepared as in
previous slide
Level 2
Model
(Logistic)
L2
Meta-
Feature
Test
preds
From raw data Predictions from L1
Models
Submitted to Kaggle

Kaggle Gold Medal Case Study

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kaggle Gold Medal Case Study

Similar to Kaggle Gold Medal Case Study (20)

Recently uploaded

Recently uploaded (20)

Kaggle Gold Medal Case Study

Editor's Notes