Garbage in, garbage out

Rubbish in
Rubbish out
Nine real examples of bad data collection

leading to bad machine learning models
Bertil Hatt
Data science
1 Innovation can be amazing. 
So amazing that’s it’s often seen as magical.

But magic isn’t real. 
Innovation is often expected
to leapfrog problems.
That can lead to very painful results. 
In my experience,

- not having good analytics, or

- a partial understanding of your product

can lead to bad surprises when innovating.

The explanation goes

Rubbish in, rubbish out.
 
But that sentence is often dropped

like an absolute truth, a counterpoint

as if it needs no detail, no explanation.
Garbage in, garbage out - 18 September 2018

Bad data is everywhere
2 This presentation is here to give 
examples of what actually happens 
when you try to apply machine learning 
on top of problematic data.

This presentation is the distillation of

a long, diverse experience

at many diﬀerent kind of institutions.

I’ve seen the problems

that I’m going to talk about, 
in many companies. 
 
I wanted to make this generic,

and not speciﬁc

to one circumstance or another.

Two made-up example
ProPowder
1. Cancellation, duplicates &
MECE recommendations

2. FX Conversion & the wealth 
of Indonesia

3. Missing category, defaults &
improper likelihood

4. Not delivering & retention

5. Bundling: reproduce actual
interaction structure
FarmGame
6. Really fast farmers: 
ﬂag outliers to preserve logic

7. Time to achievements: 
really slow buildup

8. Timing & incentives: 
what do you measure?

9. Forecasting with missing 
or censored data

3 So I imagined

two very simple ﬁctional companies.

much simpler than the companies 
that I used to worked for.

One is selling things on-line: 
think of it as a generic e-commerce website. 
I picked protein powder, 
because that’s plain and boring.

The other is a minimalist game studio.

Nothing exotic: a FarmVille clone, 
like literally dozens of them.


“This is stupid”
4 However,

even with very basic structure,

there are plenty of things that can go wrong.

The reaction that I expect

for most of these problems is:

This is stupid.

You obviously shouldn’t do things that way.

I know.

My point today is that to this is not hard.

• Most errors are stupid & easy to ﬁx once you know them

• Prioritise measured impact, bad data goes undetected

• Bad data silently hurts your decisions & experiments
5 My point is that

all of this is easy to forget,

- either because you rely on junior people,

- or just because you are tired.

Expect data to be bad:

it will always be in some way.

But bad data isn’t always inherently bad:

- a lot of times, the data is exact, 
just not well documented;

- it can even be well documented 
but the analyst, the data scientist 
have overlooked its complexity.

- Awareness is what matters; 
communications, jokes, veteran stories.

This is my veteran story.

Sell protein powder
Naive e-commerce example
6 ===18 min==

Imagine the simplest possible

e-commerce website.

You sell bags of protein powder.

Put them in a box, ship it. That’s it.

What kind of data science 
can help you do that?

Possible uses 
of data science
Customer lifetime spending

• LTV>CPA: Lifetime value to set cost-per-acquisition

• RFM: Recency, Frequency, aMount triggers reactivation

• Recommend product; No variety, so bundling size
7 The only information that you have is 
how many times & when a customer orders,

That means a lot of options in marketing:

First a classic:

computing the lifetime value of your users.

- You can use that to estimate 
the value of future customers.

- Once you have that, you can 
compare it to your cost-per-acquisition and 
decide which channels to invest in.

Another good one is the rhythm of orders:

- Have regular customers stopped ordering?

- If that’s the case, you know 
to whom you should reach out.

Finally, product recommendation.

If all you have is the same protein… meh.

You could decide to bundle 
into bigger orders, or long-term orders.

We will see how that will aﬀect your data.

• Customer orders package

• Pays for it (or fails to)

• Deliveries can fail

• Might cancel & reimburse, or 
Re-deliver the same order

• International business
8 So, what do we have:

Customers, orders.

The customer should pay for it.

- Payments may or may not work.

- Fulﬁlment, that’s you, so that should be ﬁne.

- Deliveries might fail.

If a delivery fails,

- Some customers will want to be reimbursed;

- Some will want a new delivery.

And let’s say you have international customers.

So: what does it take, on the code side?

Customer
id
delivery_address
email
…
Order
id
customer_id
quantity
status
…
Payment
id
order_id
currency
status
… Delivery
id
order_id
address
status
…
Order statuses

• Waiting payment

• Fulﬁlling

• On route

• Delivered
• Delivery failed

• Cancelled
Currency
id
currency_code
fx_gbp
last_update
…
Price quantity
quantity
price_gbp
Transaction
order_id
timestamp
old_status
new_status
…
9 You should all have a schema in your head.

A very simple schema: Order, customer

>> You probably want to normalise and 
have payment attempts and

deliveries attempts into their own tables.

>> For payment, you will need 
some reference tables: price, exchange rate.

>> The thing is: orders and deliveries 
can go through a lot of statuses.  
You probably want to track that too.

Having mutable tables is dangerous.

>> So let’s track of all transactions, 
at least ﬁnancially relevant one.

Note: even a simple case has fun questions:

- where do you store the address: 
customer or order? 
How do you handle address changes, multiple addresses? Postcodes?

Let’s ignore all that and focus on

what we need for analytics and data science.

Customer
id
delivery_address
email
…
Transaction
id
customer_id
price
currency_code
final_status
…
Denormalised
customer
id
delivery_address
country
currency
email
nb_total_orders
nb_cancellations
fist_delivery_ts
latest_delivery_ts
…
lifetime_value
should_reactivate
10 Data folks probably should focus on two tables

- Immutable Customers, and

- All relevant Transactions in their final state.

Let’s say we can aggregate transactions 
into a denormalised table of customers

with all relevant data.

That’s feature engineering.

We’ll come back to this

We want to make

estimations per customer (in green):

- How much are they going to spend, 
how much profit for us after two years? 
That’s lifetime value.

- Have they recently slowed their orders? Could they be persuaded to stay? 
That’s the marketing tag.

1. MECE, cancellation,
duplicates &
recommendations
11 ===15 min==

How could that possibly break?

First example!

Are Italians customers 
that much more valuables?
transac-
tion_id
order_id country
trans-
action_-
type
amount
1 1 UK Create 10
2 2 Italy Create 20
3 2 Italy Cancel 20
4 3 UK Create 14
country
# cust-
omers
LTV
UK 2 12
Italy 1 40
12 You have the transaction table,

and you aggregated it by country,

Here’s a very simple version,

how much each three customers

in two country contribute.

I’ll let you look in detail.

Is there anything shocking? 
Is the single Italian customer really

worth three and a half times more 
than the average British one?

(No: a cancelled order was counted twice!)

Solution
Check that Total revenue =  
Sum of revenue per user, country
13 How do you avoid having that kind of mistake?

Audit your intermediary tables.

Take your total revenue per country, per user 
and compare it to your total revenue overall.

That should also reveal

more subtle edge cases:

like paid orders never delivered,

partners who left without being paid, etc.

2. FX Conversion & the
wealth of Indonesia
14 Second really simple example,

very real this one.

Should we drop the bank 
on growth in Indonesia?
# Customer
Revenue/
customer
(USD)
EURO 24,541 505
US 21,588 495
UK 8,665 299
Canada 1,547 877
Indonesia 2,452 7,682,540,030
India 9,574 533,333
15 Here is an estimation of how much revenue

we got from diﬀerent countries.

Anything suspicious?

(The foreign exchange ratio got inverted)

Solution
Check that Total revenue =  
Sum of revenue per user, country
16 Same as previously: This is stupid

That’s the point.

Errors that end up breaking data science are

not sophisticated most of the time.

Same solution: audit your data.

3. Missing category,
defaults & improper
likelihood
17 Third error, a little more nuanced.

Denormalised
customer
id
delivery_address
traffic_attribution
currency
email
nb_total_orders
nb_cancellations
fist_delivery_ts
latest_delivery_ts
…
lifetime_value
should_reactivate
LTV per Country
country_id
country_name
# customer
total_nb_orders
avg order_amount
avg lifetime_value
18 You want to aggregate a metric per country.

You just need to assign 
a country to an address.

That’s easy, right?

But: what is a country? Is ‘Wales’ a country?

 
There should be internal services doing that,

but they might not have the right intent: 
Tax, currency, language, traffic attribution,

logistics, culture, business analysis?

- Åland pays in Euros, but VAT is not Finnish.

- French Polynesia has different everything 
but, the currency is pegged to the Euro.

- Czech republic tried to rebrand as Czechia.

- What about Northern Ireland? 
For all official business it is in the UK, 
but logistically, it’s on the Isle of Ireland.

Having dedicated services that serve

business-relevant groups or regions really help.

Solution
Categorisation as a service with 
intent: tax, business insight, etc.
19 Same for car types:

we made a mistake when

serving the Recs Algorithm, because:

- one service said Crossovers where ‘SUVs’,

- another service said they were ‘Other cars’.

Traﬃc attribution is a set of categories,

but those categories might not be clear.

- Some search users type our brand name. 
Is that part of the attribution group 
“AdWords” or a separate “Brand” one?

- What about reactivating users via AdWords?

- Let’s not talk about mobile.

All those distinctions are good ideas,

but if you change any of it, tell everyone.
Or better: build & share services 
to handle that well for the whole company,

and update those services.


4. Not delivering 
leads to high retention. 
Maybe not high LTV.
20 ===12 min==

What is the best way

to make sure that someone re-orders?

Let’s say we try to predict re-ordering,

using feature engineering and random forest.


Best predictor of retention?
Denormalised
customer
id
delivery_address
traffic_attribution
…
latest_delivery
…
period_start
period_end
reorders_period
latest_failed_delivery - period_start < 3 hours

(Just after a failed delivery)
21 Basically,

- feature engineering means: 
try every possible combination of 
any variable that you have at given time t

- random forest roughly means: 
look for the one, or multiple features that 
corresponds the most to the target; 
in this case, is there an order at time t, 
or rather during the study period.

The software will find quite rapidly:

- the difference in time between 
the last failed delivery and t, the period start 
is a really good signal.

- Said simply: people re-order 
just after a failed delivery.

It works great. It’s a great predictor.

So, should you mess up all deliveries

to increase your lifetime value? (No)

Solution
Flag same day re-order

Create meta-entity order_intent
22 Do not naively use

the metrics that you are given

from the engineering schema.

Use the metrics that

match customer’s experience.

If they reorder after a failed order,

that’s the same intent. 
Represent that as a single meta-entity

5. Bundling: 
Reproduce actual
23 Ok, those were four very naive situations.

Let’s have, for our last example

with an e-commerce website,

something a little more sophisticated.

You looked at frequency and quantity,

and you’ve decided to

recommend a special bulk oﬀer

to your clients.

Customer
id
delivery_address
email
…
Bundle
id
customer_id
quantity
quantity_left
…
Order
id
bundle_id
sticker_price
actual_price
…
Payment
id
order_id
currency
status
… Delivery
id
order_id
address
status
…
Transaction
order_id
timestamp
old_status
…
…
24 They can order three times ﬁve bags,

and receive the parcel every ﬁve weeks.

It works well for them,

it saves you delivery costs.

It’s great.

Your data people wants to tell if it works.

But you had to change your schema,

and add that bundle table.

Cool. So payments and delivery are…

Well, they pay once for the bundle,

so you start separating things.


Customer
id
delivery_address
email
…
Bundle
id
customer_id
quantity
quantity_left
…
Order
id
bundle_id
sticker_price
actual_price
…
Transaction
bundle_id
timestamp
old_status
…
Payment
id
bundle_id
… Delivery
id
order_id
address
status
…
Status changes
delivery_id
timestamp
old_status
…
…
25 The transactions are dependent on payments.

But the status of the delivery

has to be handle by another table.

Or not — it’s confusing.

Can a customer cancel a failed delivery?

For the engineer who maintains the 
denormalised customer table,

those changes break assumptions like: 
one payment is one delivery.

You will see errors because of:

- Double-counting—like the ﬁrst example,

- Odd RFM: large spends followed by 
no activity, rather than regular spends.

This confusion can lead to 
more bad data,

and more bad models.

Solution
Detect & understand 
schema changes
26 If you end up with

meta-entities containing multiple items.

1. Congratulations: they are better representations

2. You want to redeﬁne key product metrics:

- Average duration, value?

- Retention

The best way to do that is 
to invent and communicate widely 
and signiﬁcant schema change.

Two made-up example
ProPowder

of Indonesia

improper likelihood


FarmGame

really slow buildup


or censored data

27 ===7 min==

How are we doing so far?

We have more than half-way done.

Let’s talk about video games. 
Speciﬁcally casual video games like FarmVille!

FarmGame
Very casual gaming
28 The oversimpliﬁed view of casual games is: they are essentially Clic-a-cow:

- If you clic on things, you get 1 “gold” point;

- with gold you buy beautiful “objects”.

That sounds trivial, 
but it is enough to make it compelling.

You can add two types of pressure:

- Social pressure: to unlock special objects, 
you need gestures from other players;

- Time pressure: certain buildings are only available through time limited
“missions”.

One early data science project is

to set time limits just hard enough

for people to ﬁnd missions exciting.

Flag outliers to 
preserve game logic
29 We want to know

how diﬃcult certain missions were.

We know how many gold there require, but:

- players play at diﬀerent rhythm,

- they are several currencies: gold, crystals,

- suboptimal behaviour, cosmetic changes…

We want to be sure.

How fast can they complete a mission,

given its complexity.

How fast are farmers?
• Time played / 
asset collected

• “Oddly fast farmers” 
aka Witches

• ML predict duration
Difficulty of the mission 
Actions, Gold coins, etc.
Timetocomplete 
hoursofplay,calendartime
30 You need example data and run a regression.

A regression is 
the most simple model there is: 
find the line that goes closest to all the points.

>> The problem is: 
Some players find some missions too difficult.

They want the shiny Golden Cow, but

it’s easier to hack the game engine to

get all the resources instantly.

(Yes, bored pensioners learned CSS, 
race conditions and dependency injections, 
just to get a shiny cow in a game.) 
That regression, with those outliers, is odd. 
Try explaining to the game designer that

the most difficult missions are finished faster.

>> If you remove those outliers, 
You get a more reasonable trend.

Any common examples of regressions? 
- (Prices, price sensitivity)

- Do you filter for outliers?

7. Really slow farmers:
Ordinal metrics to 
time effort properly
31 Let’s assume that you manage to

fix your dependency injections in your game. 
And you want to tell:

Given how difficult a mission is,

how many players are

going to finish a mission in time.

Fastplayer
Slowplayer
Meanplay.
M
ission 
start
N
ew
 
m
ission 
start?
32 It’s very similar problem, 
but you want to look into

how widespread is your player speed. 
 
You want to set the mission duration

so that most people can ﬁnish in time.

Then they put social pressure

on the slower players

forcing them to pay hard cash

to ﬁnish the mission in time

and get the shiny golden cow too.

Fastplayer
Slowplayer
Secondatt.
Thirdatt.
Secondatt.
New 
mission 
delayed
M
ission 
start
33 But the thing is:

The faster players

want to show that they are better,

and try to do the mission twice.

And slower players do to,

just because they can.

Fastplayer
Slowplayer
Secondatt.
Thirdatt.
Secondatt.
N
ew
 
m
ission 
delayed
M
easured 
m
ea
tim
e
to
com
plete
M
ission 
start
Actual 
tim
e
to
com
plete
34 If you count the time it took

to complete the mission from

when you launched

to the moment they ﬁnished,

You are going to overestimate

because of those successive missions.

So either make the mission one-time only, 
Or measure how long it takes the ﬁrst time.

Or how long between two successful attempts.

Otherwise, you’ll make bad models.

That question,

when to start and when to stop,

is actually not trivial.

8. Time intervals &
modelling incentives: 
What do you measure?
35 In the two previous examples,

we assumed that we knew

how long it takes for someone

to complete a mission,

but that’s not always easy.

Imagine you want to encourage people

to complete their mission in a timely fashion.

You want to understand what

motivates them to work faster and well.

Assignment
Sendreview
Open
Acceptance
Completion
Firstaction
Assignment to completion
Decision to known outcome
Work
Decide to
accept
Work on 
mission
Review
Inactive
Wait
Read 
review
result
Wait
Firstaction
Work on 
mission
Wait
36 You don’t have a single Work interval.

And both you and the player

don’t have the same information

on either side at every time

If you want to represent people’s motivation, you need to think about

- when people make which decision,

- (to think about) incentives: 
if they can delay acceptance 
and if we reward fast work

>> they will lie about when they start.

- And ﬁnally, 
measure accordingly


Assignment
Sendreview
Open
Acceptance
Completion
Decide to
accept
Work on 
the mission
Inactive
Wait
Decide
to
accept
Work on 
the mission
Wait Review
Decide to
accept
Work on 
mission
Review
Inactive
Wait
Read 
review
result
Wait
Read 
review
result
Inactive
Inactive
Firstaction
37
Don’t assume that,

if they are not responding,

this is because they are idle.

Often, they simply do something else,

like play another game.

9. Forecasting 
Crash or Over-activity
(missing & censored data)
38 ===3 min==

So far so good?

Our ﬁnal example is very common.

Imagine you want to predict

how many players are playing

at the same time.

0
15
30
45
60
01
Jan08
Jan15
Jan22
Jan29
Jan05
Feb12
Feb19
Feb26
Feb05
M
ar12
M
ar19
M
ar26
M
ar02
Apr09
Apr16
Apr23
Apr30
Apr
39 And this is

the number of players per hour

in the last four months.

Because of a bug, 
there’s a missing week.

Are you going to use this data as is?

Take out the part before the hole?

Because this is cyclical,

you need more than one period.

You have to ﬁrst

re-build inferred data with a model,

to train another model

on top of that inferred data.

0
10
20
30
40
01
Jan08
Jan15
Jan22
Jan29
Jan05
Feb12
Feb19
Feb26
Feb05
M
ar12
M
ar19
M
ar26
M
ar02
Apr09
Apr16
Apr23
Apr30
Apr
What if we couldn’t predict?
40 That isn’t always possible.

Sometime,

the data that you have missing

is just not replaceable.

Imagine this is

how many players you have at a given time:

Looks like your server is saturating at 37.

For how many players

should you scale your servers?

We can’t really tell precisely,

but more than 37

0
10
20
30
40
01
Jan08
Jan15
Jan22
Jan29
Jan05
Feb12
Feb19
Feb26
Feb05
M
ar12
M
ar19
M
ar26
M
ar02
Apr09
Apr16
Apr23
Apr30
Apr
What if we stop recording?
41 Finally, the most common scenario: 
your game is super popular,

but if you hit the server limit,

it crashes the server, it crashes the game.

You log nothing at all, 
until engineers put it back on.

How do you make a forecast 
based on that data?

How many servers

do you think this game needs?

Sometimes, the best answer to bad data

is to say No.

You want to say:

“All I can do with that is guessing.

Once you have good data,

exhaustive enough data to do statistics,

then we can help.”

42 In summary,

To get data science to work,

there is a ﬁrst step

that you can easily summarise by

a lot of “this is easy” moments –

issues that you should have thought about.

Count properly,

classify properly,

measure properly.

Statistics after that,

it’s actually rather easy.

Questions?
ProPowder

of Indonesia

improper likelihood


FarmGame

really slow buildup


or censored data

43 Do you have any questions?

–––––

One question that I was asked last time was: 
What ratio of all mistakes can 
a system of audits, and checks, 
and anomaly detection,

with proper services to handle categories, 
how many errors can it catch? 
I’ve worked at company where,

with constant eﬀort, this was the vast majority.

Those companies are not better,

with more analysts, or more senior engineers.

There are companies 
who still make a lot of mistakes

who would never let an error to waste.

If something, even minor, went wrong: 
immediately, retro, improvement, more checks.

That is remarkable.

That’s what you really want to imitate.

 

Garbage in, garbage out

Recommended

Recommended

More Related Content

Similar to Garbage in, garbage out

Similar to Garbage in, garbage out (16)

More from Bertil Hatt

More from Bertil Hatt (6)

Recently uploaded

Recently uploaded (20)

Garbage in, garbage out