In this talk we will review common and subtle ways of how problem definitions can go wrong. Exemplified by cases we encounter in the field, we will discuss target leaks (the use of information which cannot be available at prediction time), address sampling bias and consider ways to identify & tackle them.
You'll hear many real-life examples of how these issues manifested and see how introducing automated feature engineering can change the way data scientists discover and treat them.
2. About Me
Meir Maor
Chief Architect @ SparkBeyond
At SparkBeyond we leverage the collective human knowledge to solve the world's
toughest problems
3. This talk
Problem setup mistakes, target leaks sampling bias and friends
How can we detect them? Can we look at a data in a way which makes these
flaws obvious?
Diverse examples from real (anonymized) problems.
4. Target Leak
Using information not actually available at prediction time, something from the
future, or something affected.
Make sure all fields in your training data are indeed available. Easy right?
5. A Retail Example
A large Retailer wants to predict who will make a purchase and how much will he
or she spend.
Since there are big differences between first and repeat customers these were
modeled separately.
One of the fields we may use is Address, it has lot’s of information. Many users
enter it at sign up so it’s available at prediction time.
6. The leak
100% of those who have ordered have the addressed filled out, while not so
initially.
Though the field is available at prediction time
We do not have a temporal database to tell us what the value was then.
8. Mining for Unobtainium*
A client in the never never land want to find new Unobtainium deposits in the
never-never lands.
A large part of the the land has been explored and we have a map of the mines
Many areas were not explored, we have no Map
* Identifying client details were changed
9. Modelling Take 1
Place a grid on the never-never land map
All grid square with a known deposit are positive
Since Unobtainium is rare all others can be assumed to be negative
Use advanced imaging, radiometric, magnetic, topographic maps, geological
maps, and more for explaining variables.
10. 99% AUC!! We are going to be rich!
Using topographic data, a big hole in the ground predicts a large deposit perfectly.
We are detecting existing active mines.
Back to the archives to find 50 year old maps from before most mines were open.
11. 96% AUC! We are going to be rich!
Distance from roads, Is an excellent predictor.
Not only do all existing mines have roads to them
Past exploration was primarily in accessible areas
Removing roads is not enough, They are hidden in all the data.
12.
13. A cure for cancer?
Early detection of cancer based on routine
medical tests.
14. Modeling take 1
Predict cancer X time units in advance of current discovery date.
For sick people take data up to X prior to diagnosis
For Healthy take a fixed time window from an average diagnosis date.
Replace all dates with relative time stamps.
15. We always model the easiest part
Detecting when the samples were taken is much easier than detecting Cancer, so
that is what the model does.
16. Take 2
A quarterly snapshot, with different positives & negatives each quarter
If we allow repeat patients we get correlated examples
If we randomly assign a patient to a quarter we don’t have enough positives
If we deduplicate but keep all positives we get a skewed distribution.
17. Feature engineering
Each of the flaws is easily spotted when we look at a good engineered feature to
exploit it
Poorly engineered features may exploit the leak/bias to a limited extent and never
get discovered
Complex models with simple features can exploit the leaks totally but are opaque
and this can go unnoticed
18. Automatic feature discovery
Exploit each leak to it’s fullest
Human understable top insights show target leaks
Allow data scientists to focus on problem definition, complex feature engineering
and iterate rapidly.
SparkBeyond provides an AI powered platform for finding insights in data. Using not only the customer's problem specific data, but also finding how that data relates to other data sources, provided by the customer or curated by SparkBeyond.
Driven by 3 examples. So some well known and less well known issues and how we can detect and deal with them.
With the correct feature finding the engineered finding the leak is trivial. With generic modeling and general purpose FE the leak may go unnoticed. Introspect you models, look at top drivers.