SlideShare a Scribd company logo
Target Leakage in ML
Yuriy Guts
Machine Learning Engineer
Automated ML, time series forecasting, NLP
Lecturer
AI, Machine Learning, Summer/Winter ML Schools
Compete sometimes
Currently hold an Expert rank, top 2% worldwide
https://github.com/YuriyGuts/odsc-target-leakage-workshop
Follow my presentation and code at:
Data known at prediction time Data unknown at prediction time
timeprediction point
Leakage in a Nutshell
contamination
Data known at prediction time Data unknown at prediction time
timeprediction point
Leakage in a Nutshell
contamination
Training on contaminated data leads to overly optimistic
expectations about model performance in production
โ€œBut I always validate on random K-fold CV. I should be fine, right?โ€
Data
Collection
Data
Preparation
Feature
Engineering
Partitioning
Training and
Tuning
Evaluation
Leakage can happen anywhere during the project lifecycle
Leakage in Data Collection
Where is the leakage?
EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD
315981 Data Scientist 3 5,000.00 78,895.44
4691 Data Scientist 4 5,500.00 86,784.98
23598 Data Scientist 5 6,200.00 97,830.35
Target is a function of another column
EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD
315981 Data Scientist 3 5,000.00 78,895.44
4691 Data Scientist 4 5,500.00 86,784.98
23598 Data Scientist 5 6,200.00 97,830.35
The target can have different formatting or measurement units in different columns.
Forgetting to remove the copies will introduce target leakage.
Check out the example: example-01-data-collection.ipynb
Where is the leakage?
SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender
24092091 M18-25 15.31 25 135.10 0
4092034091 F40-60 35.81 3 5.01 1
329815 F25-40 13.09 32 128.52 1
94721835 M25-40 18.52 21 259.34 0
Feature is an aggregate of the target
SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender
24092091 M18-25 15.31 25 135.10 0
4092034091 F40-60 35.81 3 5.01 1
329815 F25-40 13.09 32 128.52 1
94721835 M25-40 18.52 21 259.34 0
E.g., the data can have derived columns created after the fact for reporting purposes
Where is the leakage?
Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
Mutable data due to lack of snapshot-ability
Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
Database records get overwritten as more facts become available.
But these later facts wonโ€™t be available at prediction time.
Leakage in Feature Engineering
My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
OOPS. WEโ€™RE LEAKING THE TEST FEATURE DISTRIBUTION INFO
INTO THE TRAINING SET
Check out the example: example-02-data-prep.ipynb
Removing leakage in feature engineering
Dataset Partition
Train
StandardScaler
(fit)
Evaluate
StandardScaler
(transform)
train
eval
apply mean and std
obtained on train
Obtain feature engineering/transformation parameters only on the training set
Apply them to transform the evaluation sets (CV, holdout, backtests, โ€ฆ)
Encoding of different variable types
Text:
Learn DTM columns from the training set only, then transform the evaluation sets
(avoid leaking possible out-of-vocabulary words into the training pipeline)
Categoricals:
Create mappings on the training set only, then transform the evaluation sets
(avoid leaking cardinality/frequency info into the training pipeline)
Leakage in Partitioning
https://twitter.com/AndrewYNg/status/931026446717296640
Group Leakage
OOPS. THERE ARE FOUR TIMES MORE UNIQUE IMAGES THAN PATIENTS
Paper v1 (AUC) Paper v3 (AUC)
The Cold Start Problem
Observe:
Predict:
Group Partitioning, Out-of-Group Validation
Training: Validation:
Fold 1
Fold 2
Fold 3
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB โ†’ BA)
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB โ†’ BA)
OOPS. WE MAY GET COPIES SPLIT BETWEEN TRAINING AND EVALUATION
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB โ†’ BA)
First partition, then augment the training data.
Random Partitioning for Time-Aware Models
Random Partitioning for Time-Aware Models
Out-of-Time Validation (OTV)
Leakage in Training & Tuning
Reusing a CV split for multiple tasks
V
V
V
V
V
H
H
H
H
H
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Feature selection, hyperparameter tuning, model selection...
Reusing a CV split for multiple tasks
V
V
V
V
V
H
H
H
H
H
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Feature selection, hyperparameter tuning, model selection...
OOPS. CAN OVERTUNE TO THE VALIDATION SET
BETTER USE DIFFERENT SPLITS FOR DIFFERENT TASKS
Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
OOPS. WILL LEAK THE TARGET IN THE META-FEATURES
Better way to stack
Model 1
Train Predict
V
Model 2
Train Predict
V
2nd Level Model
Train
Compute all meta-features only out-of-fold
Leakage in Competitive ML
Case Studies
โ— Removing user IDs does not necessarily mean data anonymization
(Kaggle: Expedia Hotel Recommendations, 2016)
โ— Anonymizing feature names does not mean anonymization either
(Kaggle: Santander Value Prediction competition, 2018)
โ— Target can sometimes be recovered from the metadata or external datasets
(Kaggle: Dato โ€œTruly Native?โ€ competition, 2015)
โ— Overrepresented minority class in pairwise data enables reverse engineering
(Kaggle: Quora Question Pairs competition, 2017)
Prevention Measures
Leakage prevention checklist (not exhaustive!)
โ— Split the holdout away immediately and do not preprocess it in any way before final model
evaluation.
โ— Make sure you have a data dictionary and understand the meaning of every column, as well
as unusual values (e.g. negative sales) or outliers.
โ— For every column in the final feature set, try answering the question:
โ€œWill I have this feature at prediction time in my workflow? What values can it have?โ€
โ— Figure out preprocessing parameters on the training subset, freeze them elsewhere.
โ— Treat feature selection, model tuning, model selection as separate โ€œmachine learning
modelsโ€ that need to be validated separately.
โ— Make sure your validation setup represents the problem you need to solve with the model.
โ— Check feature importance and prediction explanations: do top features make sense?
Thank you!
yuriy.guts@gmail.com
linkedin.com/in/yuriyguts

More Related Content

What's hot

ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™
ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™
ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™
ๅœญ่ผ” ๅคงๆ›ฝๆ น
ย 
ใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑ
ใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑ
ใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑ
cyberagent
ย 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
Evgeniy Marinov
ย 
AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ 
AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ 
AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ 
cyberagent
ย 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Carlos Castillo (ChaTo)
ย 
ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผ
ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผ
ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผ
KLab Inc. / Tech
ย 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018
Fernando Amat
ย 
21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹
21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹
21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹
Takashi J OZAKI
ย 
่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ
่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ
่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ
Keisuke Miyagawa
ย 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
Justin Basilico
ย 
ChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdf
ChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdfChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdf
ChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdf
Ginpei Kobayashi
ย 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
ย 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
Mounia Lalmas-Roelleke
ย 
Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
Karthik Murugesan
ย 
Data-Centric AIใฎ็ดนไป‹
Data-Centric AIใฎ็ดนไป‹Data-Centric AIใฎ็ดนไป‹
Data-Centric AIใฎ็ดนไป‹
Kazuyuki Miyazawa
ย 
Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ 
Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ 
Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ 
Takuya Azumi
ย 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
Ehsan38
ย 
็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ
็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ
็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ
harmonylab
ย 
ๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
ๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understandingๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
ๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Toru Tamaki
ย 
Triplet Loss ๅพนๅบ•่งฃ่ชฌ
Triplet Loss ๅพนๅบ•่งฃ่ชฌTriplet Loss ๅพนๅบ•่งฃ่ชฌ
Triplet Loss ๅพนๅบ•่งฃ่ชฌ
tancoro
ย 

What's hot (20)

ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™
ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™
ใ„ใพใ•ใ‚‰่žใ‘ใชใ„ๆฉŸๆขฐๅญฆ็ฟ’ใฎ่ฉ•ไพกๆŒ‡ๆจ™
ย 
ใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑ
ใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑ
ใƒžใƒƒใƒใƒณใ‚ฐใ‚ตใƒผใƒ“ใ‚นใซใŠใ‘ใ‚‹KPIใฎ่ฉฑ
ย 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
ย 
AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ 
AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ 
AbemaTVใซใŠใ‘ใ‚‹ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ 
ย 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
ย 
ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผ
ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผ
ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใง้Ÿณใ‚ฒใƒผ่ญœ้ขใ‚’่‡ชๅ‹•ไฝœๆˆ๏ผ
ย 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018
ย 
21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹
21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹
21ไธ–็ด€ใงๆœ€ใ‚‚ใ‚ปใ‚ฏใ‚ทใƒผใช่ทๆฅญ๏ผ๏ผŸใ€Œใƒ‡ใƒผใ‚ฟใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใ‚นใƒˆใ€ใฎๅฎŸๅƒใซ่ฟซใ‚‹
ย 
่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ
่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ
่‡ชๅทฑ็ดนไป‹ใƒ•ใ‚šใƒฌใ‚ปใ‚™ใƒณ
ย 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
ย 
ChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdf
ChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdfChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdf
ChatGPTใฎไป•็ต„ใฟใฎ่งฃ่ชฌใจๅฎŸๅ‹™ใฆใ‚™ใฎLLMใฎ้ฉ็”จใฎ็ดนไป‹_latest.pdf
ย 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
ย 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
ย 
Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
ย 
Data-Centric AIใฎ็ดนไป‹
Data-Centric AIใฎ็ดนไป‹Data-Centric AIใฎ็ดนไป‹
Data-Centric AIใฎ็ดนไป‹
ย 
Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ 
Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ 
Autoware๏ผš ROSใ‚’็”จใ„ใŸไธ€่ˆฌ้“่‡ชๅ‹•้‹่ปขๅ‘ใ‘ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ 
ย 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
ย 
็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ
็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ
็ฏๆฒนใ‚ฟใƒณใ‚ฏๅ†…ใฎๆถฒ้ข้ซ˜่จˆๆธฌใ‚’็”จใ„ใŸ ็ฏๆฒนๆฎ‹้‡ๆŽจๅฎšใ‚ทใ‚นใƒ†ใƒ ใซ้–ขใ™ใ‚‹็ ”็ฉถ
ย 
ๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
ๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understandingๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
ๆ–‡็Œฎ็ดนไป‹๏ผšVideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
ย 
Triplet Loss ๅพนๅบ•่งฃ่ชฌ
Triplet Loss ๅพนๅบ•่งฃ่ชฌTriplet Loss ๅพนๅบ•่งฃ่ชฌ
Triplet Loss ๅพนๅบ•่งฃ่ชฌ
ย 

Similar to Target Leakage in Machine Learning

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
Yuriy Guts
ย 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
ย 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
Mark Nichols, P.E.
ย 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
ย 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
Databricks
ย 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
All Things Open
ย 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
dtz001
ย 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Jasjeet Thind
ย 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
gdgsurrey
ย 
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Databricks
ย 
Validation Is (Not) Easy
Validation Is (Not) EasyValidation Is (Not) Easy
Validation Is (Not) Easy
Dmytro Panchenko
ย 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
Databricks
ย 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
ย 
Customer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustCustomer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclust
Jim Porzak
ย 
Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018
Analytics8
ย 
Human-Centered Interpretable Machine Learning
Human-Centered Interpretable  Machine LearningHuman-Centered Interpretable  Machine Learning
Human-Centered Interpretable Machine Learning
Przemek Biecek
ย 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...
PAPIs.io
ย 
Open06
Open06Open06
Open06butest
ย 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
Songtao Guo
ย 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
Yuriy Guts
ย 

Similar to Target Leakage in Machine Learning (20)

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
ย 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
ย 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
ย 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
ย 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
ย 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
ย 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
ย 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
ย 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
ย 
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
ย 
Validation Is (Not) Easy
Validation Is (Not) EasyValidation Is (Not) Easy
Validation Is (Not) Easy
ย 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
ย 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
ย 
Customer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustCustomer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclust
ย 
Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018
ย 
Human-Centered Interpretable Machine Learning
Human-Centered Interpretable  Machine LearningHuman-Centered Interpretable  Machine Learning
Human-Centered Interpretable Machine Learning
ย 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...
ย 
Open06
Open06Open06
Open06
ย 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
ย 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
ย 

More from Yuriy Guts

Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
Yuriy Guts
ย 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
Yuriy Guts
ย 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
ย 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
Yuriy Guts
ย 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
Yuriy Guts
ย 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of RedisYuriy Guts
ย 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
Yuriy Guts
ย 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
Yuriy Guts
ย 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
Yuriy Guts
ย 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
ย 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software ArchitectureYuriy Guts
ย 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business AnalystsYuriy Guts
ย 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceYuriy Guts
ย 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
Yuriy Guts
ย 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
Yuriy Guts
ย 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
Yuriy Guts
ย 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
Yuriy Guts
ย 

More from Yuriy Guts (17)

Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
ย 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
ย 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
ย 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
ย 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
ย 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of Redis
ย 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
ย 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
ย 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
ย 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
ย 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software Architecture
ย 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business Analysts
ย 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT Audience
ย 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
ย 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
ย 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
ย 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ย 

Recently uploaded

Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
ย 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
ย 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
ย 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
ย 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
ย 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
ย 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
ย 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
ย 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
ย 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
ย 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
ย 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
ย 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
ย 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
ย 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
ย 
Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
ย 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
ย 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
ย 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
ย 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
ย 

Recently uploaded (20)

Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
ย 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
ย 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
ย 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
ย 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
ย 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
ย 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ย 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
ย 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
ย 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
ย 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
ย 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
ย 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
ย 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
ย 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
ย 
Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosityโ€™s exploration of Mars _ Local Briefs _ leadertele...
ย 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
ย 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
ย 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
ย 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
ย 

Target Leakage in Machine Learning

  • 1. Target Leakage in ML Yuriy Guts
  • 2. Machine Learning Engineer Automated ML, time series forecasting, NLP Lecturer AI, Machine Learning, Summer/Winter ML Schools Compete sometimes Currently hold an Expert rank, top 2% worldwide
  • 4. Data known at prediction time Data unknown at prediction time timeprediction point Leakage in a Nutshell contamination
  • 5. Data known at prediction time Data unknown at prediction time timeprediction point Leakage in a Nutshell contamination Training on contaminated data leads to overly optimistic expectations about model performance in production
  • 6. โ€œBut I always validate on random K-fold CV. I should be fine, right?โ€
  • 7.
  • 9. Leakage in Data Collection
  • 10. Where is the leakage? EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD 315981 Data Scientist 3 5,000.00 78,895.44 4691 Data Scientist 4 5,500.00 86,784.98 23598 Data Scientist 5 6,200.00 97,830.35
  • 11. Target is a function of another column EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD 315981 Data Scientist 3 5,000.00 78,895.44 4691 Data Scientist 4 5,500.00 86,784.98 23598 Data Scientist 5 6,200.00 97,830.35 The target can have different formatting or measurement units in different columns. Forgetting to remove the copies will introduce target leakage. Check out the example: example-01-data-collection.ipynb
  • 12. Where is the leakage? SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender 24092091 M18-25 15.31 25 135.10 0 4092034091 F40-60 35.81 3 5.01 1 329815 F25-40 13.09 32 128.52 1 94721835 M25-40 18.52 21 259.34 0
  • 13. Feature is an aggregate of the target SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender 24092091 M18-25 15.31 25 135.10 0 4092034091 F40-60 35.81 3 5.01 1 329815 F25-40 13.09 32 128.52 1 94721835 M25-40 18.52 21 259.34 0 E.g., the data can have derived columns created after the fact for reporting purposes
  • 14. Where is the leakage? Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan 1 Y 80k Car Purchase 0 0 3 N 120k Small Business 3 1 1 Y 85k House Purchase 5 1 2 N 72k Marriage 1 0
  • 15. Mutable data due to lack of snapshot-ability Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan 1 Y 80k Car Purchase 0 0 3 N 120k Small Business 3 1 1 Y 85k House Purchase 5 1 2 N 72k Marriage 1 0 Database records get overwritten as more facts become available. But these later facts wonโ€™t be available at prediction time.
  • 16. Leakage in Feature Engineering
  • 17. My model is sensitive to feature scaling... Dataset StandardScaler Partition into CV/Holdout Train Evaluate
  • 18. My model is sensitive to feature scaling... Dataset StandardScaler Partition into CV/Holdout Train Evaluate OOPS. WEโ€™RE LEAKING THE TEST FEATURE DISTRIBUTION INFO INTO THE TRAINING SET Check out the example: example-02-data-prep.ipynb
  • 19. Removing leakage in feature engineering Dataset Partition Train StandardScaler (fit) Evaluate StandardScaler (transform) train eval apply mean and std obtained on train Obtain feature engineering/transformation parameters only on the training set Apply them to transform the evaluation sets (CV, holdout, backtests, โ€ฆ)
  • 20. Encoding of different variable types Text: Learn DTM columns from the training set only, then transform the evaluation sets (avoid leaking possible out-of-vocabulary words into the training pipeline) Categoricals: Create mappings on the training set only, then transform the evaluation sets (avoid leaking cardinality/frequency info into the training pipeline)
  • 22.
  • 24. Group Leakage OOPS. THERE ARE FOUR TIMES MORE UNIQUE IMAGES THAN PATIENTS
  • 25. Paper v1 (AUC) Paper v3 (AUC)
  • 26. The Cold Start Problem Observe: Predict:
  • 27. Group Partitioning, Out-of-Group Validation Training: Validation: Fold 1 Fold 2 Fold 3
  • 28. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation
  • 29. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB โ†’ BA)
  • 30. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB โ†’ BA) OOPS. WE MAY GET COPIES SPLIT BETWEEN TRAINING AND EVALUATION
  • 31. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB โ†’ BA) First partition, then augment the training data.
  • 32. Random Partitioning for Time-Aware Models
  • 33. Random Partitioning for Time-Aware Models
  • 36. Reusing a CV split for multiple tasks V V V V V H H H H H Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Feature selection, hyperparameter tuning, model selection...
  • 37. Reusing a CV split for multiple tasks V V V V V H H H H H Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Feature selection, hyperparameter tuning, model selection... OOPS. CAN OVERTUNE TO THE VALIDATION SET BETTER USE DIFFERENT SPLITS FOR DIFFERENT TASKS
  • 38. Model stacking on in-sample predictions Model 1 Train Predict Model 2 Train Predict 2nd Level Model Train
  • 39. Model stacking on in-sample predictions Model 1 Train Predict Model 2 Train Predict 2nd Level Model Train OOPS. WILL LEAK THE TARGET IN THE META-FEATURES
  • 40. Better way to stack Model 1 Train Predict V Model 2 Train Predict V 2nd Level Model Train Compute all meta-features only out-of-fold
  • 42. Case Studies โ— Removing user IDs does not necessarily mean data anonymization (Kaggle: Expedia Hotel Recommendations, 2016) โ— Anonymizing feature names does not mean anonymization either (Kaggle: Santander Value Prediction competition, 2018) โ— Target can sometimes be recovered from the metadata or external datasets (Kaggle: Dato โ€œTruly Native?โ€ competition, 2015) โ— Overrepresented minority class in pairwise data enables reverse engineering (Kaggle: Quora Question Pairs competition, 2017)
  • 44. Leakage prevention checklist (not exhaustive!) โ— Split the holdout away immediately and do not preprocess it in any way before final model evaluation. โ— Make sure you have a data dictionary and understand the meaning of every column, as well as unusual values (e.g. negative sales) or outliers. โ— For every column in the final feature set, try answering the question: โ€œWill I have this feature at prediction time in my workflow? What values can it have?โ€ โ— Figure out preprocessing parameters on the training subset, freeze them elsewhere. โ— Treat feature selection, model tuning, model selection as separate โ€œmachine learning modelsโ€ that need to be validated separately. โ— Make sure your validation setup represents the problem you need to solve with the model. โ— Check feature importance and prediction explanations: do top features make sense?