Can automated feature engineering prevent target leaks

•Download as PPTX, PDF•

1 like•695 views

In this talk we will review common and subtle ways of how problem definitions can go wrong. Exemplified by cases we encounter in the field, we will discuss target leaks (the use of information which cannot be available at prediction time), address sampling bias and consider ways to identify & tackle them. You'll hear many real-life examples of how these issues manifested and see how introducing automated feature engineering can change the way data scientists discover and treat them.

Data & Analytics

Can Automated Feature
Engineering prevent target
leaks?
The many ways you setup your problem wrong
Meir Maor

About Me
Meir Maor
Chief Architect @ SparkBeyond
At SparkBeyond we leverage the collective human knowledge to solve the world's
toughest problems

This talk
Problem setup mistakes, target leaks sampling bias and friends
How can we detect them? Can we look at a data in a way which makes these
flaws obvious?
Diverse examples from real (anonymized) problems.

Target Leak
Using information not actually available at prediction time, something from the
future, or something affected.
Make sure all fields in your training data are indeed available. Easy right?

A Retail Example
A large Retailer wants to predict who will make a purchase and how much will he
or she spend.
Since there are big differences between first and repeat customers these were
modeled separately.
One of the fields we may use is Address, it has lot’s of information. Many users
enter it at sign up so it’s available at prediction time.

The leak
100% of those who have ordered have the addressed filled out, while not so
initially.
Though the field is available at prediction time
We do not have a temporal database to tell us what the value was then.

Feature engineering Address
Token TF-IDF
ZipCode / county
Geo-location
Address length, address non-empty

Mining for Unobtainium*
A client in the never never land want to find new Unobtainium deposits in the
never-never lands.
A large part of the the land has been explored and we have a map of the mines
Many areas were not explored, we have no Map
* Identifying client details were changed

Modelling Take 1
Place a grid on the never-never land map
All grid square with a known deposit are positive
Since Unobtainium is rare all others can be assumed to be negative
Use advanced imaging, radiometric, magnetic, topographic maps, geological
maps, and more for explaining variables.

99% AUC!! We are going to be rich!
Using topographic data, a big hole in the ground predicts a large deposit perfectly.
We are detecting existing active mines.
Back to the archives to find 50 year old maps from before most mines were open.

96% AUC! We are going to be rich!
Distance from roads, Is an excellent predictor.
Not only do all existing mines have roads to them
Past exploration was primarily in accessible areas
Removing roads is not enough, They are hidden in all the data.

A cure for cancer?
Early detection of cancer based on routine
medical tests.

Modeling take 1
Predict cancer X time units in advance of current discovery date.
For sick people take data up to X prior to diagnosis
For Healthy take a fixed time window from an average diagnosis date.
Replace all dates with relative time stamps.

We always model the easiest part
Detecting when the samples were taken is much easier than detecting Cancer, so
that is what the model does.

Take 2
A quarterly snapshot, with different positives & negatives each quarter
If we allow repeat patients we get correlated examples
If we randomly assign a patient to a quarter we don’t have enough positives
If we deduplicate but keep all positives we get a skewed distribution.

Feature engineering
Each of the flaws is easily spotted when we look at a good engineered feature to
exploit it
Poorly engineered features may exploit the leak/bias to a limited extent and never
get discovered
Complex models with simple features can exploit the leaks totally but are opaque
and this can go unnoticed

Automatic feature discovery
Exploit each leak to it’s fullest
Human understable top insights show target leaks
Allow data scientists to focus on problem definition, complex feature engineering
and iterate rapidly.

Join Us
http://www.sparkbeyond.com/careers
Try the SparkBeyond Challenge: http://bit.ly/dss16-quiz

Viewers also liked

Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaData Con LA

Feature engineering for diverse data typesAlice Zheng

Reverse Engineering Feature Models From Software Variants to Build Software P...Ra'Fat Al-Msie'deen

20140425 cisec-human factor-f-reuzeauCISEC

BSSML16 L7. Feature EngineeringBigML, Inc

Operations strategy in a global environmentthefivetens

Feature Engineering odsc

Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...Steelwedge

The How and Why of Feature EngineeringAlice Zheng

Overview of Machine Learning and Feature EngineeringTuri, Inc.

Human Factors Training in Aviationaviation-training

Global Operations and Supply Chain Management: Airbus vs. Boeing Final Assig...Jamar Johnson

Viewers also liked (12)

Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota

Feature engineering for diverse data types

Reverse Engineering Feature Models From Software Variants to Build Software P...

20140425 cisec-human factor-f-reuzeau

BSSML16 L7. Feature Engineering

Operations strategy in a global environment

Feature Engineering

Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...

The How and Why of Feature Engineering

Overview of Machine Learning and Feature Engineering

Human Factors Training in Aviation

Global Operations and Supply Chain Management: Airbus vs. Boeing Final Assig...

Similar to Can automated feature engineering prevent target leaks

Lessons Learned Using Direct Sensing TechnologiesJohn Sohl

SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...NAVER Engineering

Quiz #101) Assume you view your eyelashes using a concave mirr.docxmakdul

Pin On Sample Sop For Masters In Engineering MaCarla Potier

Smart Sensors, Detect Cavity IDM10Qatar University- Young Scientists Center (Al-Bairaq)

GeoSpatial Standards in Emergency ManagementMaurits van der Vlugt

Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...Comit Projects Ltd

Simulation pitfalls p302023vijaykale1981

Advantages and disadvantages of Remote SensingEr Abhi Vashi

The Lean Hardware ToolboxLean Startup Co.

Design_Thinking_CA1_N00147768Stephen Norman

260119 a digital approach towards market research uploadSyed Yeasef Akbar

Electronic surveyingifmrcmf

Coordinates And Camera Angles UpdateJayGallagher

Bob Shoup - Nail Guns Do Not Build HousesDaniel Matranga

2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD

Being Agile and Seeing Big PictureAlex Leonov

The math behind big systems analysis.Theo Schlossnagle

Morpheus3d company introductionㅁㅁㅁ

RISK EVALUATION-1Stig-Arne Kristoffersen

Similar to Can automated feature engineering prevent target leaks (20)

Lessons Learned Using Direct Sensing Technologies

SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...

Quiz #101) Assume you view your eyelashes using a concave mirr.docx

Pin On Sample Sop For Masters In Engineering Ma

Smart Sensors, Detect Cavity IDM10

GeoSpatial Standards in Emergency Management

Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...

Simulation pitfalls p302023

Advantages and disadvantages of Remote Sensing

The Lean Hardware Toolbox

Design_Thinking_CA1_N00147768

260119 a digital approach towards market research upload

Electronic surveying

Coordinates And Camera Angles Update

Bob Shoup - Nail Guns Do Not Build Houses

2013.12.12 - Sydney - Big Data Analytics

Being Agile and Seeing Big Picture

The math behind big systems analysis.

Morpheus3d company introduction

RISK EVALUATION-1

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Industrialised data - the key to AI success.pdfLars Albertsson

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Invezz.com - Grow your wealth with trading signalsInvezz1

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

April 2024 - Crypto Market Report's Analysismanisha194592

Midocean dropshipping via API with DroFxolyaivanovalion

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Halmar dropshipping via API with DroFxolyaivanovalion

Ukraine War presentation: KNOW THE BASICSAishani27

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Industrialised data - the key to AI success.pdf

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Call Girls In Mahipalpur O9654467111 Escorts Service

E-Commerce Order PredictionShraddha Kamble.pptx

Invezz.com - Grow your wealth with trading signals

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

April 2024 - Crypto Market Report's Analysis

Midocean dropshipping via API with DroFx

Schema on read is obsolete. Welcome metaprogramming..pdf

Unveiling Insights: The Role of a Data Analyst

Halmar dropshipping via API with DroFx

Ukraine War presentation: KNOW THE BASICS

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

100-Concepts-of-AI by Anupama Kate .pptx

Can automated feature engineering prevent target leaks

1. Can Automated Feature Engineering prevent target leaks? The many ways you setup your problem wrong Meir Maor

2. About Me Meir Maor Chief Architect @ SparkBeyond At SparkBeyond we leverage the collective human knowledge to solve the world's toughest problems

3. This talk Problem setup mistakes, target leaks sampling bias and friends How can we detect them? Can we look at a data in a way which makes these flaws obvious? Diverse examples from real (anonymized) problems.

4. Target Leak Using information not actually available at prediction time, something from the future, or something affected. Make sure all fields in your training data are indeed available. Easy right?

5. A Retail Example A large Retailer wants to predict who will make a purchase and how much will he or she spend. Since there are big differences between first and repeat customers these were modeled separately. One of the fields we may use is Address, it has lot’s of information. Many users enter it at sign up so it’s available at prediction time.

6. The leak 100% of those who have ordered have the addressed filled out, while not so initially. Though the field is available at prediction time We do not have a temporal database to tell us what the value was then.

7. Feature engineering Address Token TF-IDF ZipCode / county Geo-location Address length, address non-empty

8. Mining for Unobtainium* A client in the never never land want to find new Unobtainium deposits in the never-never lands. A large part of the the land has been explored and we have a map of the mines Many areas were not explored, we have no Map * Identifying client details were changed

9. Modelling Take 1 Place a grid on the never-never land map All grid square with a known deposit are positive Since Unobtainium is rare all others can be assumed to be negative Use advanced imaging, radiometric, magnetic, topographic maps, geological maps, and more for explaining variables.

10. 99% AUC!! We are going to be rich! Using topographic data, a big hole in the ground predicts a large deposit perfectly. We are detecting existing active mines. Back to the archives to find 50 year old maps from before most mines were open.

11. 96% AUC! We are going to be rich! Distance from roads, Is an excellent predictor. Not only do all existing mines have roads to them Past exploration was primarily in accessible areas Removing roads is not enough, They are hidden in all the data.

12.

13. A cure for cancer? Early detection of cancer based on routine medical tests.

14. Modeling take 1 Predict cancer X time units in advance of current discovery date. For sick people take data up to X prior to diagnosis For Healthy take a fixed time window from an average diagnosis date. Replace all dates with relative time stamps.

15. We always model the easiest part Detecting when the samples were taken is much easier than detecting Cancer, so that is what the model does.

16. Take 2 A quarterly snapshot, with different positives & negatives each quarter If we allow repeat patients we get correlated examples If we randomly assign a patient to a quarter we don’t have enough positives If we deduplicate but keep all positives we get a skewed distribution.

17. Feature engineering Each of the flaws is easily spotted when we look at a good engineered feature to exploit it Poorly engineered features may exploit the leak/bias to a limited extent and never get discovered Complex models with simple features can exploit the leaks totally but are opaque and this can go unnoticed

18. Automatic feature discovery Exploit each leak to it’s fullest Human understable top insights show target leaks Allow data scientists to focus on problem definition, complex feature engineering and iterate rapidly.

19. Join Us http://www.sparkbeyond.com/careers Try the SparkBeyond Challenge: http://bit.ly/dss16-quiz

Editor's Notes

SparkBeyond provides an AI powered platform for finding insights in data. Using not only the customer's problem specific data, but also finding how that data relates to other data sources, provided by the customer or curated by SparkBeyond.
Driven by 3 examples. So some well known and less well known issues and how we can detect and deal with them.
With the correct feature finding the engineered finding the leak is trivial. With generic modeling and general purpose FE the leak may go unnoticed. Introspect you models, look at top drivers.

Can automated feature engineering prevent target leaks

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Can automated feature engineering prevent target leaks

Similar to Can automated feature engineering prevent target leaks (20)

More from Meir Maor

More from Meir Maor (6)

Recently uploaded

Recently uploaded (20)

Can automated feature engineering prevent target leaks

Editor's Notes