SlideShare a Scribd company logo
1 of 19
Can Automated Feature
Engineering prevent target
leaks?
The many ways you setup your problem wrong
Meir Maor
About Me
Meir Maor
Chief Architect @ SparkBeyond
At SparkBeyond we leverage the collective human knowledge to solve the world's
toughest problems
This talk
Problem setup mistakes, target leaks sampling bias and friends
How can we detect them? Can we look at a data in a way which makes these
flaws obvious?
Diverse examples from real (anonymized) problems.
Target Leak
Using information not actually available at prediction time, something from the
future, or something affected.
Make sure all fields in your training data are indeed available. Easy right?
A Retail Example
A large Retailer wants to predict who will make a purchase and how much will he
or she spend.
Since there are big differences between first and repeat customers these were
modeled separately.
One of the fields we may use is Address, it has lot’s of information. Many users
enter it at sign up so it’s available at prediction time.
The leak
100% of those who have ordered have the addressed filled out, while not so
initially.
Though the field is available at prediction time
We do not have a temporal database to tell us what the value was then.
Feature engineering Address
Token TF-IDF
ZipCode / county
Geo-location
Address length, address non-empty
Mining for Unobtainium*
A client in the never never land want to find new Unobtainium deposits in the
never-never lands.
A large part of the the land has been explored and we have a map of the mines
Many areas were not explored, we have no Map
* Identifying client details were changed
Modelling Take 1
Place a grid on the never-never land map
All grid square with a known deposit are positive
Since Unobtainium is rare all others can be assumed to be negative
Use advanced imaging, radiometric, magnetic, topographic maps, geological
maps, and more for explaining variables.
99% AUC!! We are going to be rich!
Using topographic data, a big hole in the ground predicts a large deposit perfectly.
We are detecting existing active mines.
Back to the archives to find 50 year old maps from before most mines were open.
96% AUC! We are going to be rich!
Distance from roads, Is an excellent predictor.
Not only do all existing mines have roads to them
Past exploration was primarily in accessible areas
Removing roads is not enough, They are hidden in all the data.
A cure for cancer?
Early detection of cancer based on routine
medical tests.
Modeling take 1
Predict cancer X time units in advance of current discovery date.
For sick people take data up to X prior to diagnosis
For Healthy take a fixed time window from an average diagnosis date.
Replace all dates with relative time stamps.
We always model the easiest part
Detecting when the samples were taken is much easier than detecting Cancer, so
that is what the model does.
Take 2
A quarterly snapshot, with different positives & negatives each quarter
If we allow repeat patients we get correlated examples
If we randomly assign a patient to a quarter we don’t have enough positives
If we deduplicate but keep all positives we get a skewed distribution.
Feature engineering
Each of the flaws is easily spotted when we look at a good engineered feature to
exploit it
Poorly engineered features may exploit the leak/bias to a limited extent and never
get discovered
Complex models with simple features can exploit the leaks totally but are opaque
and this can go unnoticed
Automatic feature discovery
Exploit each leak to it’s fullest
Human understable top insights show target leaks
Allow data scientists to focus on problem definition, complex feature engineering
and iterate rapidly.
Join Us
http://www.sparkbeyond.com/careers
Try the SparkBeyond Challenge: http://bit.ly/dss16-quiz

More Related Content

Viewers also liked

Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaData Con LA
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data typesAlice Zheng
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Ra'Fat Al-Msie'deen
 
20140425 cisec-human factor-f-reuzeau
20140425 cisec-human factor-f-reuzeau20140425 cisec-human factor-f-reuzeau
20140425 cisec-human factor-f-reuzeauCISEC
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
 
Operations strategy in a global environment
Operations strategy in a global environmentOperations strategy in a global environment
Operations strategy in a global environmentthefivetens
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...
Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...
Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...Steelwedge
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Human Factors Training in Aviation
Human Factors Training in AviationHuman Factors Training in Aviation
Human Factors Training in Aviationaviation-training
 
Global Operations and Supply Chain Management: Airbus vs. Boeing Final Assig...
Global Operations and Supply Chain Management:  Airbus vs. Boeing Final Assig...Global Operations and Supply Chain Management:  Airbus vs. Boeing Final Assig...
Global Operations and Supply Chain Management: Airbus vs. Boeing Final Assig...Jamar Johnson
 

Viewers also liked (12)

Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
 
20140425 cisec-human factor-f-reuzeau
20140425 cisec-human factor-f-reuzeau20140425 cisec-human factor-f-reuzeau
20140425 cisec-human factor-f-reuzeau
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
Operations strategy in a global environment
Operations strategy in a global environmentOperations strategy in a global environment
Operations strategy in a global environment
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...
Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...
Your Sales and Operations Planning (S&OP) Analytics: Crystal Ball or Ball and...
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Human Factors Training in Aviation
Human Factors Training in AviationHuman Factors Training in Aviation
Human Factors Training in Aviation
 
Global Operations and Supply Chain Management: Airbus vs. Boeing Final Assig...
Global Operations and Supply Chain Management:  Airbus vs. Boeing Final Assig...Global Operations and Supply Chain Management:  Airbus vs. Boeing Final Assig...
Global Operations and Supply Chain Management: Airbus vs. Boeing Final Assig...
 

Similar to Can automated feature engineering prevent target leaks

Lessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing TechnologiesLessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing TechnologiesJohn Sohl
 
SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...
SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...
SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...NAVER Engineering
 
Quiz #101) Assume you view your eyelashes using a concave mirr.docx
Quiz #101) Assume you view your eyelashes using a concave mirr.docxQuiz #101) Assume you view your eyelashes using a concave mirr.docx
Quiz #101) Assume you view your eyelashes using a concave mirr.docxmakdul
 
Pin On Sample Sop For Masters In Engineering Ma
Pin On Sample Sop For Masters In Engineering MaPin On Sample Sop For Masters In Engineering Ma
Pin On Sample Sop For Masters In Engineering MaCarla Potier
 
GeoSpatial Standards in Emergency Management
GeoSpatial Standards in Emergency ManagementGeoSpatial Standards in Emergency Management
GeoSpatial Standards in Emergency ManagementMaurits van der Vlugt
 
Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...
Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...
Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...Comit Projects Ltd
 
Simulation pitfalls p302023
Simulation pitfalls p302023Simulation pitfalls p302023
Simulation pitfalls p302023vijaykale1981
 
Advantages and disadvantages of Remote Sensing
Advantages and disadvantages of Remote SensingAdvantages and disadvantages of Remote Sensing
Advantages and disadvantages of Remote SensingEr Abhi Vashi
 
Design_Thinking_CA1_N00147768
Design_Thinking_CA1_N00147768Design_Thinking_CA1_N00147768
Design_Thinking_CA1_N00147768Stephen Norman
 
260119 a digital approach towards market research upload
260119 a digital approach towards market research upload260119 a digital approach towards market research upload
260119 a digital approach towards market research uploadSyed Yeasef Akbar
 
Electronic surveying
Electronic surveyingElectronic surveying
Electronic surveyingifmrcmf
 
Coordinates And Camera Angles Update
Coordinates And Camera Angles UpdateCoordinates And Camera Angles Update
Coordinates And Camera Angles UpdateJayGallagher
 
Bob Shoup - Nail Guns Do Not Build Houses
Bob Shoup - Nail Guns Do Not Build HousesBob Shoup - Nail Guns Do Not Build Houses
Bob Shoup - Nail Guns Do Not Build HousesDaniel Matranga
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
Being Agile and Seeing Big Picture
Being Agile and Seeing Big PictureBeing Agile and Seeing Big Picture
Being Agile and Seeing Big PictureAlex Leonov
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 
Morpheus3d company introduction
Morpheus3d company introductionMorpheus3d company introduction
Morpheus3d company introductionㅁㅁㅁ
 

Similar to Can automated feature engineering prevent target leaks (20)

Lessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing TechnologiesLessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing Technologies
 
SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...
SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...
SafeguardAI and Surprise Based Learning -- Protect your AI solutions from Uni...
 
Quiz #101) Assume you view your eyelashes using a concave mirr.docx
Quiz #101) Assume you view your eyelashes using a concave mirr.docxQuiz #101) Assume you view your eyelashes using a concave mirr.docx
Quiz #101) Assume you view your eyelashes using a concave mirr.docx
 
Pin On Sample Sop For Masters In Engineering Ma
Pin On Sample Sop For Masters In Engineering MaPin On Sample Sop For Masters In Engineering Ma
Pin On Sample Sop For Masters In Engineering Ma
 
Smart Sensors, Detect Cavity IDM10
Smart Sensors, Detect Cavity IDM10Smart Sensors, Detect Cavity IDM10
Smart Sensors, Detect Cavity IDM10
 
GeoSpatial Standards in Emergency Management
GeoSpatial Standards in Emergency ManagementGeoSpatial Standards in Emergency Management
GeoSpatial Standards in Emergency Management
 
Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...
Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...
Sensors, threats, responses and challenges - Dr Emil Lupu (Imperial College L...
 
Simulation pitfalls p302023
Simulation pitfalls p302023Simulation pitfalls p302023
Simulation pitfalls p302023
 
Advantages and disadvantages of Remote Sensing
Advantages and disadvantages of Remote SensingAdvantages and disadvantages of Remote Sensing
Advantages and disadvantages of Remote Sensing
 
The Lean Hardware Toolbox
The Lean Hardware ToolboxThe Lean Hardware Toolbox
The Lean Hardware Toolbox
 
Design_Thinking_CA1_N00147768
Design_Thinking_CA1_N00147768Design_Thinking_CA1_N00147768
Design_Thinking_CA1_N00147768
 
260119 a digital approach towards market research upload
260119 a digital approach towards market research upload260119 a digital approach towards market research upload
260119 a digital approach towards market research upload
 
Electronic surveying
Electronic surveyingElectronic surveying
Electronic surveying
 
Coordinates And Camera Angles Update
Coordinates And Camera Angles UpdateCoordinates And Camera Angles Update
Coordinates And Camera Angles Update
 
Bob Shoup - Nail Guns Do Not Build Houses
Bob Shoup - Nail Guns Do Not Build HousesBob Shoup - Nail Guns Do Not Build Houses
Bob Shoup - Nail Guns Do Not Build Houses
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
Being Agile and Seeing Big Picture
Being Agile and Seeing Big PictureBeing Agile and Seeing Big Picture
Being Agile and Seeing Big Picture
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Morpheus3d company introduction
Morpheus3d company introductionMorpheus3d company introduction
Morpheus3d company introduction
 
RISK EVALUATION-1
RISK EVALUATION-1RISK EVALUATION-1
RISK EVALUATION-1
 

More from Meir Maor

Sketch algoritms
Sketch algoritmsSketch algoritms
Sketch algoritmsMeir Maor
 
Actionable Machine Learning
Actionable Machine LearningActionable Machine Learning
Actionable Machine LearningMeir Maor
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine LearningMeir Maor
 
Prior On Model Space
Prior On Model SpacePrior On Model Space
Prior On Model SpaceMeir Maor
 
Scala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgrammingScala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgrammingMeir Maor
 
10 Things I Hate About Scala
10 Things I Hate About Scala10 Things I Hate About Scala
10 Things I Hate About ScalaMeir Maor
 

More from Meir Maor (6)

Sketch algoritms
Sketch algoritmsSketch algoritms
Sketch algoritms
 
Actionable Machine Learning
Actionable Machine LearningActionable Machine Learning
Actionable Machine Learning
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
 
Prior On Model Space
Prior On Model SpacePrior On Model Space
Prior On Model Space
 
Scala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgrammingScala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgramming
 
10 Things I Hate About Scala
10 Things I Hate About Scala10 Things I Hate About Scala
10 Things I Hate About Scala
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Can automated feature engineering prevent target leaks

  • 1. Can Automated Feature Engineering prevent target leaks? The many ways you setup your problem wrong Meir Maor
  • 2. About Me Meir Maor Chief Architect @ SparkBeyond At SparkBeyond we leverage the collective human knowledge to solve the world's toughest problems
  • 3. This talk Problem setup mistakes, target leaks sampling bias and friends How can we detect them? Can we look at a data in a way which makes these flaws obvious? Diverse examples from real (anonymized) problems.
  • 4. Target Leak Using information not actually available at prediction time, something from the future, or something affected. Make sure all fields in your training data are indeed available. Easy right?
  • 5. A Retail Example A large Retailer wants to predict who will make a purchase and how much will he or she spend. Since there are big differences between first and repeat customers these were modeled separately. One of the fields we may use is Address, it has lot’s of information. Many users enter it at sign up so it’s available at prediction time.
  • 6. The leak 100% of those who have ordered have the addressed filled out, while not so initially. Though the field is available at prediction time We do not have a temporal database to tell us what the value was then.
  • 7. Feature engineering Address Token TF-IDF ZipCode / county Geo-location Address length, address non-empty
  • 8. Mining for Unobtainium* A client in the never never land want to find new Unobtainium deposits in the never-never lands. A large part of the the land has been explored and we have a map of the mines Many areas were not explored, we have no Map * Identifying client details were changed
  • 9. Modelling Take 1 Place a grid on the never-never land map All grid square with a known deposit are positive Since Unobtainium is rare all others can be assumed to be negative Use advanced imaging, radiometric, magnetic, topographic maps, geological maps, and more for explaining variables.
  • 10. 99% AUC!! We are going to be rich! Using topographic data, a big hole in the ground predicts a large deposit perfectly. We are detecting existing active mines. Back to the archives to find 50 year old maps from before most mines were open.
  • 11. 96% AUC! We are going to be rich! Distance from roads, Is an excellent predictor. Not only do all existing mines have roads to them Past exploration was primarily in accessible areas Removing roads is not enough, They are hidden in all the data.
  • 12.
  • 13. A cure for cancer? Early detection of cancer based on routine medical tests.
  • 14. Modeling take 1 Predict cancer X time units in advance of current discovery date. For sick people take data up to X prior to diagnosis For Healthy take a fixed time window from an average diagnosis date. Replace all dates with relative time stamps.
  • 15. We always model the easiest part Detecting when the samples were taken is much easier than detecting Cancer, so that is what the model does.
  • 16. Take 2 A quarterly snapshot, with different positives & negatives each quarter If we allow repeat patients we get correlated examples If we randomly assign a patient to a quarter we don’t have enough positives If we deduplicate but keep all positives we get a skewed distribution.
  • 17. Feature engineering Each of the flaws is easily spotted when we look at a good engineered feature to exploit it Poorly engineered features may exploit the leak/bias to a limited extent and never get discovered Complex models with simple features can exploit the leaks totally but are opaque and this can go unnoticed
  • 18. Automatic feature discovery Exploit each leak to it’s fullest Human understable top insights show target leaks Allow data scientists to focus on problem definition, complex feature engineering and iterate rapidly.
  • 19. Join Us http://www.sparkbeyond.com/careers Try the SparkBeyond Challenge: http://bit.ly/dss16-quiz

Editor's Notes

  1. SparkBeyond provides an AI powered platform for finding insights in data. Using not only the customer's problem specific data, but also finding how that data relates to other data sources, provided by the customer or curated by SparkBeyond.
  2. Driven by 3 examples. So some well known and less well known issues and how we can detect and deal with them.
  3. With the correct feature finding the engineered finding the leak is trivial. With generic modeling and general purpose FE the leak may go unnoticed. Introspect you models, look at top drivers.