SlideShare a Scribd company logo
1 of 52
Download to read offline
Machine Learning Engineer
Automated ML, computer vision, time series, NLP
Teach sometimes
Guest lecturer: AI, DS, ML
Compete sometimes
“Competitions Expert” rank, top 3% worldwide
https://github.com/YuriyGuts/odsc-target-leakage-workshop
Follow my presentation and code at:
Target Leakage in ML
Data known at prediction time Data unknown at prediction time
time
prediction point
Leakage in a Nutshell
contamination
Data known at prediction time Data unknown at prediction time
time
prediction point
Leakage in a Nutshell
contamination
Training on contaminated data leads to overly optimistic
expectations about model performance in production
“But I always validate on random K-fold CV. I should be fine, right?”
Data
Collection
Data
Preparation
Feature
Engineering
Partitioning
Training and
Tuning
Evaluation
Leakage can happen anywhere during the project lifecycle
Leakage in Data Collection
Where is the leakage?
INSTNM CITY STABBR (500 Cols Skipped) MD_EARN_WNE_P10
Empire Beauty
School-Jackson
Jackson TN ... 17,100
Geneva College Beaver Falls PA ... 38,700
The Art Institute of Austin Austin TX ... 34,100
College of Business and
Technology-Cutler Bay
Cutler Bay FL ... 23,800
University of Hartford West Hartford CT ... 47,800
Median earnings of students 10 years after entry
Source: https://collegescorecard.ed.gov/
Where is the leakage?
INSTNM CITY STABBR MD_EARN_WNE_P6 MD_EARN_WNE_P10
Empire Beauty
School-Jackson
Jackson TN 15,900 17,100
Geneva College Beaver Falls PA 33,000 38,700
The Art Institute of Austin Austin TX 27,300 34,100
College of Business and
Technology-Cutler Bay
Cutler Bay FL 21,600 23,800
University of Hartford West Hartford CT 38,400 47,800
6-year median earnings are highly predictive of 10-year median earnings.
But they are unavailable at prediction (admission) time
Target is a function of another column
MEDIAN_EARNINGS_MONTHLY MEDIAN_EARNINGS_GROUP MD_EARN_WNE_P10
1425 LOW 17,100
1985 MED 23,800
3985 HIGH 47,800
The target can have different formatting or measurement units in different columns.
It can also have derived columns created after the fact for reporting purposes.
Forgetting to remove the copies will introduce target leakage.
Check out the example: example-01-data-collection.ipynb
Where is the leakage?
EducationLvl Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
Mutable data due to lack of snapshot-ability
EducationLvl Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
Database records get overwritten as more facts become available.
But these later facts won’t be available at prediction time.
Leakage in Feature Engineering
My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
OOPS. WE’RE LEAKING THE TEST FEATURE DISTRIBUTION INFO
INTO THE TRAINING SET
Check out the example: example-02-data-prep.ipynb
Removing leakage in feature engineering
Dataset Partition
Train
StandardScaler
(fit)
Evaluate
StandardScaler
(transform)
train
eval
apply mean and std
obtained on train
Obtain feature engineering/transformation parameters only on the training set
Apply them to transform the evaluation sets (CV, holdout, backtests, …)
Encoding of different variable types
Text:
Learn DTM columns from the training set only, then transform the evaluation sets
(avoid leaking possible out-of-vocabulary words into the training pipeline)
Categoricals:
Create mappings on the training set only, then transform the evaluation sets
(avoid leaking cardinality/frequency info into the training pipeline)
What about deep learning?
What about deep learning?
What about deep learning?
OOPS. THE MODEL MIGHT NOT NECESSARILY LEARN THE
FEATURES YOU THINK IT LEARNS. INTERPRETATION IS KEY
CLASS ACTIVATION MAP
INDICATES THE MODEL PAYS
ATTENTION TO PEOPLE
Leakage in Partitioning
https://twitter.com/AndrewYNg/status/931026446717296640
Group Leakage
OOPS. THERE ARE FOUR TIMES MORE UNIQUE IMAGES THAN PATIENTS
Paper v1 (AUC) Paper v3 (AUC)
The Cold Start Problem
Observe:
Predict:
Group Partitioning, Out-of-Group Validation
Training: Validation:
Fold 1
Fold 2
Fold 3
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB → BA)
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB → BA)
OOPS. WE MAY GET COPIES SPLIT BETWEEN TRAINING AND EVALUATION
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB → BA)
First partition, then augment the training data.
Overly Optimistic Prediction Results on Imbalanced Data
“We noticed a large number (24!) of studies reporting near-perfect results on a public dataset when
estimating the risk of preterm birth for a patient”
https://arxiv.org/abs/2001.06296
https://github.com/GillesVandewiele/EHG-oversampling/
Random Partitioning for Time-Aware Models
Random Partitioning for Time-Aware Models
Out-of-Time Validation (OTV)
Leakage in Training & Tuning
Reusing the same Train-Validation split for multiple tasks
T V H
Feature selection
T V H
Hyperparameter tuning
T V H
Early stopping
T V H
Model selection
Reusing the same Train-Validation split for multiple tasks
T V H
Feature selection
T V H
Hyperparameter tuning
T V H
Early stopping
T V H
Model selection
OOPS. CAN MANUALLY OVERFIT TO THE VALIDATION SET.
BETTER USE NESTED CV / DIFFERENT SPLITS FOR DIFFERENT TASKS
Different tasks should be validated differently
T H
Feature selection
V
feature
selection
T
Hyperparameter tuning
V
hyper
tuning
T
Model selection
V
model
selection
Pre-deployment check T
H
H
H
V
model
selection
V
model
selection
Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
OOPS. WILL LEAK THE TARGET IN THE META-FEATURES
Proper way to stack models
Model 1
Train Predict
V
Model 2
Train Predict
V
2nd Level Model
Train
Compute all meta-features only out-of-fold
Leakage in Competitive ML
Case Studies
● Target can sometimes be recovered from the metadata
(Kaggle: Dato “Truly Native?” competition, 2015)
● ...or external data (e.g. web scraping)
(Kaggle: PetFinder.my Adoption Prediction, 2019)
● Overrepresented minority class in pairwise data enables reverse engineering
(Kaggle: Quora Question Pairs competition, 2017)
● Removing user IDs does not necessarily mean data anonymization
(Kaggle: Expedia Hotel Recommendations, 2016)
Prevention Measures
Leakage prevention checklist (not exhaustive!)
● Split the holdout away immediately and do not preprocess it before final model evaluation.
● Make sure you have a data dictionary and understand the meaning of every column, as well
as unusual values (e.g. negative sales) or outliers.
● For every column in the final feature set, try answering the question:
“Will I have this feature at prediction time in my workflow? What values can it have?”
● Figure out preprocessing parameters on the training subset, freeze them elsewhere.
For scikit-learn, use Pipelines.
● Treat feature selection, model tuning, model selection as separate “machine learning
models” that need to be validated separately.
● Make sure your validation setup represents the problem you need to solve with the model.
● Check feature importance and prediction explanations: do top features make sense?
Thank you!
yuriy.guts@gmail.com
linkedin.com/in/yuriyguts

More Related Content

Similar to Target Leakage in Machine Learning (ODSC East 2020)

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Balancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningBalancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningDatabricks
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleSongtao Guo
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
How to Use Machine Learning as a Product Manager by Wework PM
 How to Use Machine Learning as a Product Manager by Wework PM How to Use Machine Learning as a Product Manager by Wework PM
How to Use Machine Learning as a Product Manager by Wework PMProduct School
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialQiang Zhu
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 

Similar to Target Leakage in Machine Learning (ODSC East 2020) (20)

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Balancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningBalancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine Learning
 
Presentation
PresentationPresentation
Presentation
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
How to Use Machine Learning as a Product Manager by Wework PM
 How to Use Machine Learning as a Product Manager by Wework PM How to Use Machine Learning as a Product Manager by Wework PM
How to Use Machine Learning as a Product Manager by Wework PM
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 

More from Yuriy Guts

Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLPYuriy Guts
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)Yuriy Guts
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivYuriy Guts
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of RedisYuriy Guts
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in ScalaYuriy Guts
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET DevelopersYuriy Guts
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETYuriy Guts
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software ArchitectureYuriy Guts
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business AnalystsYuriy Guts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceYuriy Guts
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseYuriy Guts
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesYuriy Guts
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingYuriy Guts
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryYuriy Guts
 

More from Yuriy Guts (18)

Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of Redis
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software Architecture
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business Analysts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT Audience
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Target Leakage in Machine Learning (ODSC East 2020)

  • 1.
  • 2. Machine Learning Engineer Automated ML, computer vision, time series, NLP Teach sometimes Guest lecturer: AI, DS, ML Compete sometimes “Competitions Expert” rank, top 3% worldwide
  • 5. Data known at prediction time Data unknown at prediction time time prediction point Leakage in a Nutshell contamination
  • 6. Data known at prediction time Data unknown at prediction time time prediction point Leakage in a Nutshell contamination Training on contaminated data leads to overly optimistic expectations about model performance in production
  • 7. “But I always validate on random K-fold CV. I should be fine, right?”
  • 8.
  • 10. Leakage in Data Collection
  • 11. Where is the leakage? INSTNM CITY STABBR (500 Cols Skipped) MD_EARN_WNE_P10 Empire Beauty School-Jackson Jackson TN ... 17,100 Geneva College Beaver Falls PA ... 38,700 The Art Institute of Austin Austin TX ... 34,100 College of Business and Technology-Cutler Bay Cutler Bay FL ... 23,800 University of Hartford West Hartford CT ... 47,800 Median earnings of students 10 years after entry Source: https://collegescorecard.ed.gov/
  • 12. Where is the leakage? INSTNM CITY STABBR MD_EARN_WNE_P6 MD_EARN_WNE_P10 Empire Beauty School-Jackson Jackson TN 15,900 17,100 Geneva College Beaver Falls PA 33,000 38,700 The Art Institute of Austin Austin TX 27,300 34,100 College of Business and Technology-Cutler Bay Cutler Bay FL 21,600 23,800 University of Hartford West Hartford CT 38,400 47,800 6-year median earnings are highly predictive of 10-year median earnings. But they are unavailable at prediction (admission) time
  • 13. Target is a function of another column MEDIAN_EARNINGS_MONTHLY MEDIAN_EARNINGS_GROUP MD_EARN_WNE_P10 1425 LOW 17,100 1985 MED 23,800 3985 HIGH 47,800 The target can have different formatting or measurement units in different columns. It can also have derived columns created after the fact for reporting purposes. Forgetting to remove the copies will introduce target leakage. Check out the example: example-01-data-collection.ipynb
  • 14. Where is the leakage? EducationLvl Married AnnualIncome Purpose LatePaymentReminders IsBadLoan 1 Y 80k Car Purchase 0 0 3 N 120k Small Business 3 1 1 Y 85k House Purchase 5 1 2 N 72k Marriage 1 0
  • 15. Mutable data due to lack of snapshot-ability EducationLvl Married AnnualIncome Purpose LatePaymentReminders IsBadLoan 1 Y 80k Car Purchase 0 0 3 N 120k Small Business 3 1 1 Y 85k House Purchase 5 1 2 N 72k Marriage 1 0 Database records get overwritten as more facts become available. But these later facts won’t be available at prediction time.
  • 16. Leakage in Feature Engineering
  • 17. My model is sensitive to feature scaling... Dataset StandardScaler Partition into CV/Holdout Train Evaluate
  • 18. My model is sensitive to feature scaling... Dataset StandardScaler Partition into CV/Holdout Train Evaluate OOPS. WE’RE LEAKING THE TEST FEATURE DISTRIBUTION INFO INTO THE TRAINING SET Check out the example: example-02-data-prep.ipynb
  • 19. Removing leakage in feature engineering Dataset Partition Train StandardScaler (fit) Evaluate StandardScaler (transform) train eval apply mean and std obtained on train Obtain feature engineering/transformation parameters only on the training set Apply them to transform the evaluation sets (CV, holdout, backtests, …)
  • 20. Encoding of different variable types Text: Learn DTM columns from the training set only, then transform the evaluation sets (avoid leaking possible out-of-vocabulary words into the training pipeline) Categoricals: Create mappings on the training set only, then transform the evaluation sets (avoid leaking cardinality/frequency info into the training pipeline)
  • 21. What about deep learning?
  • 22.
  • 23. What about deep learning?
  • 24. What about deep learning? OOPS. THE MODEL MIGHT NOT NECESSARILY LEARN THE FEATURES YOU THINK IT LEARNS. INTERPRETATION IS KEY
  • 25. CLASS ACTIVATION MAP INDICATES THE MODEL PAYS ATTENTION TO PEOPLE
  • 27.
  • 29. Group Leakage OOPS. THERE ARE FOUR TIMES MORE UNIQUE IMAGES THAN PATIENTS
  • 30. Paper v1 (AUC) Paper v3 (AUC)
  • 31. The Cold Start Problem Observe: Predict:
  • 32. Group Partitioning, Out-of-Group Validation Training: Validation: Fold 1 Fold 2 Fold 3
  • 33. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation
  • 34. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB → BA)
  • 35. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB → BA) OOPS. WE MAY GET COPIES SPLIT BETWEEN TRAINING AND EVALUATION
  • 36. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB → BA) First partition, then augment the training data.
  • 37. Overly Optimistic Prediction Results on Imbalanced Data “We noticed a large number (24!) of studies reporting near-perfect results on a public dataset when estimating the risk of preterm birth for a patient” https://arxiv.org/abs/2001.06296 https://github.com/GillesVandewiele/EHG-oversampling/
  • 38. Random Partitioning for Time-Aware Models
  • 39. Random Partitioning for Time-Aware Models
  • 42. Reusing the same Train-Validation split for multiple tasks T V H Feature selection T V H Hyperparameter tuning T V H Early stopping T V H Model selection
  • 43. Reusing the same Train-Validation split for multiple tasks T V H Feature selection T V H Hyperparameter tuning T V H Early stopping T V H Model selection OOPS. CAN MANUALLY OVERFIT TO THE VALIDATION SET. BETTER USE NESTED CV / DIFFERENT SPLITS FOR DIFFERENT TASKS
  • 44. Different tasks should be validated differently T H Feature selection V feature selection T Hyperparameter tuning V hyper tuning T Model selection V model selection Pre-deployment check T H H H V model selection V model selection
  • 45. Model stacking on in-sample predictions Model 1 Train Predict Model 2 Train Predict 2nd Level Model Train
  • 46. Model stacking on in-sample predictions Model 1 Train Predict Model 2 Train Predict 2nd Level Model Train OOPS. WILL LEAK THE TARGET IN THE META-FEATURES
  • 47. Proper way to stack models Model 1 Train Predict V Model 2 Train Predict V 2nd Level Model Train Compute all meta-features only out-of-fold
  • 49. Case Studies ● Target can sometimes be recovered from the metadata (Kaggle: Dato “Truly Native?” competition, 2015) ● ...or external data (e.g. web scraping) (Kaggle: PetFinder.my Adoption Prediction, 2019) ● Overrepresented minority class in pairwise data enables reverse engineering (Kaggle: Quora Question Pairs competition, 2017) ● Removing user IDs does not necessarily mean data anonymization (Kaggle: Expedia Hotel Recommendations, 2016)
  • 51. Leakage prevention checklist (not exhaustive!) ● Split the holdout away immediately and do not preprocess it before final model evaluation. ● Make sure you have a data dictionary and understand the meaning of every column, as well as unusual values (e.g. negative sales) or outliers. ● For every column in the final feature set, try answering the question: “Will I have this feature at prediction time in my workflow? What values can it have?” ● Figure out preprocessing parameters on the training subset, freeze them elsewhere. For scikit-learn, use Pipelines. ● Treat feature selection, model tuning, model selection as separate “machine learning models” that need to be validated separately. ● Make sure your validation setup represents the problem you need to solve with the model. ● Check feature importance and prediction explanations: do top features make sense?