The document discusses common issues that can arise from bad or problematic data when applying machine learning. It provides nine examples of real problems the author has encountered, ranging from simple issues like double counting cancelled orders, to more complex issues involving schema changes not being properly communicated. The key message is that even simple systems can encounter data problems, and it is important to audit data for errors or inconsistencies, detect schema changes, and clearly define metrics to avoid "garbage in, garbage out" situations that produce bad machine learning models.
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
Garbage in, garbage out
1. Rubbish in
Rubbish out
Nine real examples of bad data collection
leading to bad machine learning models
Bertil Hatt
Data science
1 Innovation can be amazing.
So amazing that’s it’s often seen as magical.
But magic isn’t real.
Innovation is often expected
to leapfrog problems.
That can lead to very painful results.
In my experience,
- not having good analytics, or
- a partial understanding of your product
can lead to bad surprises when innovating.
The explanation goes
Rubbish in, rubbish out.
But that sentence is often dropped
like an absolute truth, a counterpoint
as if it needs no detail, no explanation.
Garbage in, garbage out - 18 September 2018
2. Bad data is everywhere
2 This presentation is here to give
examples of what actually happens
when you try to apply machine learning
on top of problematic data.
This presentation is the distillation of
a long, diverse experience
at many different kind of institutions.
I’ve seen the problems
that I’m going to talk about,
in many companies.
I wanted to make this generic,
and not specific
to one circumstance or another.
Garbage in, garbage out - 18 September 2018
3. Two made-up example
ProPowder
1. Cancellation, duplicates &
MECE recommendations
2. FX Conversion & the wealth
of Indonesia
3. Missing category, defaults &
improper likelihood
4. Not delivering & retention
5. Bundling: reproduce actual
interaction structure
FarmGame
6. Really fast farmers:
flag outliers to preserve logic
7. Time to achievements:
really slow buildup
8. Timing & incentives:
what do you measure?
9. Forecasting with missing
or censored data
3 So I imagined
two very simple fictional companies.
much simpler than the companies
that I used to worked for.
One is selling things on-line:
think of it as a generic e-commerce website.
I picked protein powder,
because that’s plain and boring.
The other is a minimalist game studio.
Nothing exotic: a FarmVille clone,
like literally dozens of them.
Garbage in, garbage out - 18 September 2018
4. “This is stupid”
4 However,
even with very basic structure,
there are plenty of things that can go wrong.
The reaction that I expect
for most of these problems is:
This is stupid.
You obviously shouldn’t do things that way.
I know.
My point today is that to this is not hard.
Garbage in, garbage out - 18 September 2018
5. “This is stupid”
• Most errors are stupid & easy to fix once you know them
• Prioritise measured impact, bad data goes undetected
• Bad data silently hurts your decisions & experiments
5 My point is that
all of this is easy to forget,
- either because you rely on junior people,
- or just because you are tired.
Expect data to be bad:
it will always be in some way.
But bad data isn’t always inherently bad:
- a lot of times, the data is exact,
just not well documented;
- it can even be well documented
but the analyst, the data scientist
have overlooked its complexity.
- Awareness is what matters;
communications, jokes, veteran stories.
This is my veteran story.
Garbage in, garbage out - 18 September 2018
6. Sell protein powder
Naive e-commerce example
6 ===18 min==
Imagine the simplest possible
e-commerce website.
You sell bags of protein powder.
Put them in a box, ship it. That’s it.
What kind of data science
can help you do that?
Garbage in, garbage out - 18 September 2018
7. Possible uses
of data science
Customer lifetime spending
• LTV>CPA: Lifetime value to set cost-per-acquisition
• RFM: Recency, Frequency, aMount triggers reactivation
• Recommend product; No variety, so bundling size
7 The only information that you have is
how many times & when a customer orders,
That means a lot of options in marketing:
First a classic:
computing the lifetime value of your users.
- You can use that to estimate
the value of future customers.
- Once you have that, you can
compare it to your cost-per-acquisition and
decide which channels to invest in.
Another good one is the rhythm of orders:
- Have regular customers stopped ordering?
- If that’s the case, you know
to whom you should reach out.
Finally, product recommendation.
If all you have is the same protein… meh.
You could decide to bundle
into bigger orders, or long-term orders.
We will see how that will affect your data.
Garbage in, garbage out - 18 September 2018
8. • Customer orders package
• Pays for it (or fails to)
• Deliveries can fail
• Might cancel & reimburse, or
Re-deliver the same order
• International business
8 So, what do we have:
Customers, orders.
The customer should pay for it.
- Payments may or may not work.
- Fulfilment, that’s you, so that should be fine.
- Deliveries might fail.
If a delivery fails,
- Some customers will want to be reimbursed;
- Some will want a new delivery.
And let’s say you have international customers.
So: what does it take, on the code side?
Garbage in, garbage out - 18 September 2018
9. Customer
id
delivery_address
email
…
Order
id
customer_id
quantity
status
…
Payment
id
order_id
currency
status
… Delivery
id
order_id
address
status
…
Order statuses
• Waiting payment
• Fulfilling
• On route
• Delivered
• Delivery failed
• Cancelled
Currency
id
currency_code
fx_gbp
last_update
…
Price quantity
quantity
price_gbp
Transaction
order_id
timestamp
old_status
new_status
…
9 You should all have a schema in your head.
A very simple schema: Order, customer
>> You probably want to normalise and
have payment attempts and
deliveries attempts into their own tables.
>> For payment, you will need
some reference tables: price, exchange rate.
>> The thing is: orders and deliveries
can go through a lot of statuses.
You probably want to track that too.
Having mutable tables is dangerous.
>> So let’s track of all transactions,
at least financially relevant one.
Note: even a simple case has fun questions:
- where do you store the address:
customer or order?
How do you handle address changes, multiple addresses? Postcodes?
Let’s ignore all that and focus on
what we need for analytics and data science.
Garbage in, garbage out - 18 September 2018
11. 1. MECE, cancellation,
duplicates &
recommendations
11 ===15 min==
How could that possibly break?
First example!
Garbage in, garbage out - 18 September 2018
12. Are Italians customers
that much more valuables?
transac-
tion_id
order_id country
trans-
action_-
type
amount
1 1 UK Create 10
2 2 Italy Create 20
3 2 Italy Cancel 20
4 3 UK Create 14
country
# cust-
omers
LTV
UK 2 12
Italy 1 40
12 You have the transaction table,
and you aggregated it by country,
Here’s a very simple version,
how much each three customers
in two country contribute.
I’ll let you look in detail.
Is there anything shocking?
Is the single Italian customer really
worth three and a half times more
than the average British one?
(No: a cancelled order was counted twice!)
Garbage in, garbage out - 18 September 2018
13. Solution
Check that Total revenue =
Sum of revenue per user, country
13 How do you avoid having that kind of mistake?
Audit your intermediary tables.
Take your total revenue per country, per user
and compare it to your total revenue overall.
That should also reveal
more subtle edge cases:
like paid orders never delivered,
partners who left without being paid, etc.
Garbage in, garbage out - 18 September 2018
14. 2. FX Conversion & the
wealth of Indonesia
14 Second really simple example,
very real this one.
Garbage in, garbage out - 18 September 2018
15. Should we drop the bank
on growth in Indonesia?
# Customer
Revenue/
customer
(USD)
EURO 24,541 505
US 21,588 495
UK 8,665 299
Canada 1,547 877
Indonesia 2,452 7,682,540,030
India 9,574 533,333
15 Here is an estimation of how much revenue
we got from different countries.
Anything suspicious?
(The foreign exchange ratio got inverted)
Garbage in, garbage out - 18 September 2018
16. Solution
Check that Total revenue =
Sum of revenue per user, country
16 Same as previously: This is stupid
That’s the point.
Errors that end up breaking data science are
not sophisticated most of the time.
Same solution: audit your data.
Garbage in, garbage out - 18 September 2018
17. 3. Missing category,
defaults & improper
likelihood
17 Third error, a little more nuanced.
Garbage in, garbage out - 18 September 2018
18. Denormalised
customer
id
delivery_address
traffic_attribution
currency
email
nb_total_orders
nb_cancellations
fist_delivery_ts
latest_delivery_ts
…
lifetime_value
should_reactivate
LTV per Country
country_id
country_name
# customer
total_nb_orders
avg order_amount
avg lifetime_value
18 You want to aggregate a metric per country.
You just need to assign
a country to an address.
That’s easy, right?
But: what is a country? Is ‘Wales’ a country?
There should be internal services doing that,
but they might not have the right intent:
Tax, currency, language, traffic attribution,
logistics, culture, business analysis?
- Åland pays in Euros, but VAT is not Finnish.
- French Polynesia has different everything
but, the currency is pegged to the Euro.
- Czech republic tried to rebrand as Czechia.
- What about Northern Ireland?
For all official business it is in the UK,
but logistically, it’s on the Isle of Ireland.
Having dedicated services that serve
business-relevant groups or regions really help.
Garbage in, garbage out - 18 September 2018
19. Solution
Categorisation as a service with
intent: tax, business insight, etc.
19 Same for car types:
we made a mistake when
serving the Recs Algorithm, because:
- one service said Crossovers where ‘SUVs’,
- another service said they were ‘Other cars’.
Traffic attribution is a set of categories,
but those categories might not be clear.
- Some search users type our brand name.
Is that part of the attribution group
“AdWords” or a separate “Brand” one?
- What about reactivating users via AdWords?
- Let’s not talk about mobile.
All those distinctions are good ideas,
but if you change any of it, tell everyone.
Or better: build & share services
to handle that well for the whole company,
and update those services.
Garbage in, garbage out - 18 September 2018
20. 4. Not delivering
leads to high retention.
Maybe not high LTV.
20 ===12 min==
What is the best way
to make sure that someone re-orders?
Let’s say we try to predict re-ordering,
using feature engineering and random forest.
Garbage in, garbage out - 18 September 2018
21. Best predictor of retention?
Denormalised
customer
id
delivery_address
traffic_attribution
…
latest_delivery
…
period_start
period_end
reorders_period
latest_failed_delivery - period_start < 3 hours
(Just after a failed delivery)
21 Basically,
- feature engineering means:
try every possible combination of
any variable that you have at given time t
- random forest roughly means:
look for the one, or multiple features that
corresponds the most to the target;
in this case, is there an order at time t,
or rather during the study period.
The software will find quite rapidly:
- the difference in time between
the last failed delivery and t, the period start
is a really good signal.
- Said simply: people re-order
just after a failed delivery.
It works great. It’s a great predictor.
So, should you mess up all deliveries
to increase your lifetime value? (No)
Garbage in, garbage out - 18 September 2018
22. Solution
Flag same day re-order
Create meta-entity order_intent
22 Do not naively use
the metrics that you are given
from the engineering schema.
Use the metrics that
match customer’s experience.
If they reorder after a failed order,
that’s the same intent.
Represent that as a single meta-entity
Garbage in, garbage out - 18 September 2018
23. 5. Bundling:
Reproduce actual
interaction structure
23 Ok, those were four very naive situations.
Let’s have, for our last example
with an e-commerce website,
something a little more sophisticated.
You looked at frequency and quantity,
and you’ve decided to
recommend a special bulk offer
to your clients.
Garbage in, garbage out - 18 September 2018
26. Solution
Detect & understand
schema changes
26 If you end up with
meta-entities containing multiple items.
1. Congratulations: they are better representations
2. You want to redefine key product metrics:
- Average duration, value?
- Retention
The best way to do that is
to invent and communicate widely
and significant schema change.
Garbage in, garbage out - 18 September 2018
27. Two made-up example
ProPowder
1. Cancellation, duplicates &
MECE recommendations
2. FX Conversion & the wealth
of Indonesia
3. Missing category, defaults &
improper likelihood
4. Not delivering & retention
5. Bundling: reproduce actual
interaction structure
FarmGame
6. Really fast farmers:
flag outliers to preserve logic
7. Time to achievements:
really slow buildup
8. Timing & incentives:
what do you measure?
9. Forecasting with missing
or censored data
27 ===7 min==
How are we doing so far?
We have more than half-way done.
Let’s talk about video games.
Specifically casual video games like FarmVille!
Garbage in, garbage out - 18 September 2018
28. FarmGame
Very casual gaming
28 The oversimplified view of casual games is: they are essentially Clic-a-cow:
- If you clic on things, you get 1 “gold” point;
- with gold you buy beautiful “objects”.
That sounds trivial,
but it is enough to make it compelling.
You can add two types of pressure:
- Social pressure: to unlock special objects,
you need gestures from other players;
- Time pressure: certain buildings are only available through time limited
“missions”.
One early data science project is
to set time limits just hard enough
for people to find missions exciting.
Garbage in, garbage out - 18 September 2018
29. 6. Really fast farmers:
Flag outliers to
preserve game logic
29 We want to know
how difficult certain missions were.
We know how many gold there require, but:
- players play at different rhythm,
- they are several currencies: gold, crystals,
- suboptimal behaviour, cosmetic changes…
We want to be sure.
How fast can they complete a mission,
given its complexity.
Garbage in, garbage out - 18 September 2018
30. How fast are farmers?
• Time played /
asset collected
• “Oddly fast farmers”
aka Witches
• ML predict duration
Difficulty of the mission
Actions, Gold coins, etc.
Timetocomplete
hoursofplay,calendartime
30 You need example data and run a regression.
A regression is
the most simple model there is:
find the line that goes closest to all the points.
>> The problem is:
Some players find some missions too difficult.
They want the shiny Golden Cow, but
it’s easier to hack the game engine to
get all the resources instantly.
(Yes, bored pensioners learned CSS,
race conditions and dependency injections,
just to get a shiny cow in a game.)
That regression, with those outliers, is odd.
Try explaining to the game designer that
the most difficult missions are finished faster.
>> If you remove those outliers,
You get a more reasonable trend.
Any common examples of regressions?
- (Prices, price sensitivity)
- Do you filter for outliers?
Garbage in, garbage out - 18 September 2018
31. 7. Really slow farmers:
Ordinal metrics to
time effort properly
31 Let’s assume that you manage to
fix your dependency injections in your game.
And you want to tell:
Given how difficult a mission is,
how many players are
going to finish a mission in time.
Garbage in, garbage out - 18 September 2018
32. Fastplayer
Slowplayer
Meanplay.
M
ission
start
N
ew
m
ission
start?
32 It’s very similar problem,
but you want to look into
how widespread is your player speed.
You want to set the mission duration
so that most people can finish in time.
Then they put social pressure
on the slower players
forcing them to pay hard cash
to finish the mission in time
and get the shiny golden cow too.
Garbage in, garbage out - 18 September 2018
35. 8. Time intervals &
modelling incentives:
What do you measure?
35 In the two previous examples,
we assumed that we knew
how long it takes for someone
to complete a mission,
but that’s not always easy.
Imagine you want to encourage people
to complete their mission in a timely fashion.
You want to understand what
motivates them to work faster and well.
Garbage in, garbage out - 18 September 2018
36. Assignment
Sendreview
Open
Acceptance
Completion
Firstaction
Assignment to completion
Decision to known outcome
Work
Decide to
accept
Work on
mission
Review
Inactive
Wait
Read
review
result
Wait
Firstaction
Work on
mission
Wait
36 You don’t have a single Work interval.
And both you and the player
don’t have the same information
on either side at every time
If you want to represent people’s motivation, you need to think about
- when people make which decision,
- (to think about) incentives:
if they can delay acceptance
and if we reward fast work
>> they will lie about when they start.
- And finally,
measure accordingly
Garbage in, garbage out - 18 September 2018
37. Assignment
Sendreview
Open
Acceptance
Completion
Decide to
accept
Work on
the mission
Inactive
Wait
Decide
to
accept
Work on
the mission
Wait Review
Decide to
accept
Work on
mission
Review
Inactive
Wait
Read
review
result
Wait
Read
review
result
Inactive
Inactive
Firstaction
37
Don’t assume that,
if they are not responding,
this is because they are idle.
Often, they simply do something else,
like play another game.
Garbage in, garbage out - 18 September 2018
38. 9. Forecasting
Crash or Over-activity
(missing & censored data)
38 ===3 min==
So far so good?
Our final example is very common.
Imagine you want to predict
how many players are playing
at the same time.
Garbage in, garbage out - 18 September 2018
41. 0
10
20
30
40
01
Jan08
Jan15
Jan22
Jan29
Jan05
Feb12
Feb19
Feb26
Feb05
M
ar12
M
ar19
M
ar26
M
ar02
Apr09
Apr16
Apr23
Apr30
Apr
What if we stop recording?
41 Finally, the most common scenario:
your game is super popular,
but if you hit the server limit,
it crashes the server, it crashes the game.
You log nothing at all,
until engineers put it back on.
How do you make a forecast
based on that data?
How many servers
do you think this game needs?
Sometimes, the best answer to bad data
is to say No.
You want to say:
“All I can do with that is guessing.
Once you have good data,
exhaustive enough data to do statistics,
then we can help.”
Garbage in, garbage out - 18 September 2018
42. “This is stupid”
42 In summary,
To get data science to work,
there is a first step
that you can easily summarise by
a lot of “this is easy” moments –
issues that you should have thought about.
Count properly,
classify properly,
measure properly.
Statistics after that,
it’s actually rather easy.
Garbage in, garbage out - 18 September 2018
43. Questions?
ProPowder
1. Cancellation, duplicates &
MECE recommendations
2. FX Conversion & the wealth
of Indonesia
3. Missing category, defaults &
improper likelihood
4. Not delivering & retention
5. Bundling: reproduce actual
interaction structure
FarmGame
6. Really fast farmers:
flag outliers to preserve logic
7. Time to achievements:
really slow buildup
8. Timing & incentives:
what do you measure?
9. Forecasting with missing
or censored data
43 Do you have any questions?
–––––
One question that I was asked last time was:
What ratio of all mistakes can
a system of audits, and checks,
and anomaly detection,
with proper services to handle categories,
how many errors can it catch?
I’ve worked at company where,
with constant effort, this was the vast majority.
Those companies are not better,
with more analysts, or more senior engineers.
There are companies
who still make a lot of mistakes
who would never let an error to waste.
If something, even minor, went wrong:
immediately, retro, improvement, more checks.
That is remarkable.
That’s what you really want to imitate.
Garbage in, garbage out - 18 September 2018