SlideShare a Scribd company logo
1 of 43
Download to read offline
Rubbish in
Rubbish out
Nine real examples of bad data collection

leading to bad machine learning models
Bertil Hatt
Data science
1 Innovation can be amazing.

So amazing that’s it’s often seen as magical.

But magic isn’t real.

Innovation is often expected
to leapfrog problems.
That can lead to very painful results.

In my experience,

- not having good analytics, or 

- a partial understanding of your product

can lead to bad surprises when innovating.

The explanation goes

Rubbish in, rubbish out.


But that sentence is often dropped

like an absolute truth, a counterpoint

as if it needs no detail, no explanation.
Garbage in, garbage out - 18 September 2018
Bad data is everywhere
2 This presentation is here to give

examples of what actually happens

when you try to apply machine learning

on top of problematic data. 

This presentation is the distillation of

a long, diverse experience

at many different kind of institutions.

I’ve seen the problems

that I’m going to talk about,

in many companies.



I wanted to make this generic,

and not specific

to one circumstance or another.
Garbage in, garbage out - 18 September 2018
Two made-up example
ProPowder
1. Cancellation, duplicates &
MECE recommendations

2. FX Conversion & the wealth

of Indonesia

3. Missing category, defaults &
improper likelihood

4. Not delivering & retention

5. Bundling: reproduce actual
interaction structure
FarmGame
6. Really fast farmers:

flag outliers to preserve logic

7. Time to achievements:

really slow buildup

8. Timing & incentives:

what do you measure? 

9. Forecasting with missing

or censored data

3 So I imagined

two very simple fictional companies.

much simpler than the companies

that I used to worked for.

One is selling things on-line:

think of it as a generic e-commerce website.

I picked protein powder,

because that’s plain and boring.

The other is a minimalist game studio.

Nothing exotic: a FarmVille clone,

like literally dozens of them.

Garbage in, garbage out - 18 September 2018
“This is stupid”
4 However,

even with very basic structure,

there are plenty of things that can go wrong. 

The reaction that I expect

for most of these problems is:

This is stupid.

You obviously shouldn’t do things that way.

I know.

My point today is that to this is not hard.
Garbage in, garbage out - 18 September 2018
“This is stupid”
• Most errors are stupid & easy to fix once you know them

• Prioritise measured impact, bad data goes undetected

• Bad data silently hurts your decisions & experiments
5 My point is that

all of this is easy to forget,

- either because you rely on junior people, 

- or just because you are tired.

Expect data to be bad:

it will always be in some way.

But bad data isn’t always inherently bad: 

- a lot of times, the data is exact,

just not well documented;

- it can even be well documented

but the analyst, the data scientist

have overlooked its complexity.

- Awareness is what matters;

communications, jokes, veteran stories.

This is my veteran story.
Garbage in, garbage out - 18 September 2018
Sell protein powder
Naive e-commerce example
6 ===18 min==

Imagine the simplest possible

e-commerce website. 

You sell bags of protein powder.

Put them in a box, ship it. That’s it.

What kind of data science

can help you do that?
Garbage in, garbage out - 18 September 2018
Possible uses

of data science
Customer lifetime spending

• LTV>CPA: Lifetime value to set cost-per-acquisition

• RFM: Recency, Frequency, aMount triggers reactivation

• Recommend product; No variety, so bundling size
7 The only information that you have is

how many times & when a customer orders,

That means a lot of options in marketing:

First a classic:

computing the lifetime value of your users.

- You can use that to estimate

the value of future customers. 

- Once you have that, you can

compare it to your cost-per-acquisition and

decide which channels to invest in.

Another good one is the rhythm of orders:

- Have regular customers stopped ordering? 

- If that’s the case, you know

to whom you should reach out.

Finally, product recommendation.

If all you have is the same protein… meh.

You could decide to bundle

into bigger orders, or long-term orders.

We will see how that will affect your data.
Garbage in, garbage out - 18 September 2018
• Customer orders package

• Pays for it (or fails to)

• Deliveries can fail

• Might cancel & reimburse, or

Re-deliver the same order

• International business
8 So, what do we have:

Customers, orders.

The customer should pay for it.

- Payments may or may not work.

- Fulfilment, that’s you, so that should be fine.

- Deliveries might fail.

If a delivery fails,

- Some customers will want to be reimbursed;

- Some will want a new delivery.

And let’s say you have international customers.

So: what does it take, on the code side?
Garbage in, garbage out - 18 September 2018
Customer
id
delivery_address
email
…
Order
id
customer_id
quantity
status
…
Payment
id
order_id
currency
status
… Delivery
id
order_id
address
status
…
Order statuses

• Waiting payment

• Fulfilling

• On route

• Delivered
• Delivery failed

• Cancelled
Currency
id
currency_code
fx_gbp
last_update
…
Price quantity
quantity
price_gbp
Transaction
order_id
timestamp
old_status
new_status
…
9 You should all have a schema in your head.

A very simple schema: Order, customer

>> You probably want to normalise and

have payment attempts and

deliveries attempts into their own tables.

>> For payment, you will need

some reference tables: price, exchange rate.

>> The thing is: orders and deliveries

can go through a lot of statuses. 

You probably want to track that too.

Having mutable tables is dangerous.

>> So let’s track of all transactions,

at least financially relevant one.

Note: even a simple case has fun questions:

- where do you store the address:

customer or order?

How do you handle address changes, multiple addresses? Postcodes?

Let’s ignore all that and focus on

what we need for analytics and data science.
Garbage in, garbage out - 18 September 2018
Customer
id
delivery_address
email
…
Transaction
id
customer_id
price
currency_code
final_status
…
Denormalised
customer
id
delivery_address
country
currency
email
nb_total_orders
nb_cancellations
fist_delivery_ts
latest_delivery_ts
…
lifetime_value
should_reactivate
10 Data folks probably should focus on two tables

- Immutable Customers, and

- All relevant Transactions in their final state.

Let’s say we can aggregate transactions

into a denormalised table of customers

with all relevant data.

That’s feature engineering.

We’ll come back to this

We want to make

estimations per customer (in green):

- How much are they going to spend,

how much profit for us after two years?

That’s lifetime value.

- Have they recently slowed their orders? Could they be persuaded to stay?

That’s the marketing tag.
Garbage in, garbage out - 18 September 2018
1. MECE, cancellation,
duplicates &
recommendations
11 ===15 min==

How could that possibly break?

First example!
Garbage in, garbage out - 18 September 2018
Are Italians customers

that much more valuables?
transac-
tion_id
order_id country
trans-
action_-
type
amount
1 1 UK Create 10
2 2 Italy Create 20
3 2 Italy Cancel 20
4 3 UK Create 14
country
# cust-
omers
LTV
UK 2 12
Italy 1 40
12 You have the transaction table,

and you aggregated it by country,

Here’s a very simple version,

how much each three customers

in two country contribute.

I’ll let you look in detail.

Is there anything shocking?

Is the single Italian customer really

worth three and a half times more

than the average British one?

(No: a cancelled order was counted twice!)
Garbage in, garbage out - 18 September 2018
Solution
Check that Total revenue = 

Sum of revenue per user, country
13 How do you avoid having that kind of mistake?

Audit your intermediary tables.

Take your total revenue per country, per user

and compare it to your total revenue overall.

That should also reveal 

more subtle edge cases:

like paid orders never delivered,

partners who left without being paid, etc.
Garbage in, garbage out - 18 September 2018
2. FX Conversion & the
wealth of Indonesia
14 Second really simple example,

very real this one.
Garbage in, garbage out - 18 September 2018
Should we drop the bank

on growth in Indonesia?
# Customer
Revenue/
customer
(USD)
EURO 24,541 505
US 21,588 495
UK 8,665 299
Canada 1,547 877
Indonesia 2,452 7,682,540,030
India 9,574 533,333
15 Here is an estimation of how much revenue

we got from different countries.

Anything suspicious?

(The foreign exchange ratio got inverted)
Garbage in, garbage out - 18 September 2018
Solution
Check that Total revenue = 

Sum of revenue per user, country
16 Same as previously: This is stupid

That’s the point.

Errors that end up breaking data science are

not sophisticated most of the time.

Same solution: audit your data.
Garbage in, garbage out - 18 September 2018
3. Missing category,
defaults & improper
likelihood
17 Third error, a little more nuanced.
Garbage in, garbage out - 18 September 2018
Denormalised
customer
id
delivery_address
traffic_attribution
currency
email
nb_total_orders
nb_cancellations
fist_delivery_ts
latest_delivery_ts
…
lifetime_value
should_reactivate
LTV per Country
country_id
country_name
# customer
total_nb_orders
avg order_amount
avg lifetime_value
18 You want to aggregate a metric per country.

You just need to assign

a country to an address.

That’s easy, right?

But: what is a country? Is ‘Wales’ a country?



There should be internal services doing that,

but they might not have the right intent:

Tax, currency, language, traffic attribution,

logistics, culture, business analysis?

- Åland pays in Euros, but VAT is not Finnish.

- French Polynesia has different everything

but, the currency is pegged to the Euro.

- Czech republic tried to rebrand as Czechia.

- What about Northern Ireland?

For all official business it is in the UK,

but logistically, it’s on the Isle of Ireland.

Having dedicated services that serve

business-relevant groups or regions really help.
Garbage in, garbage out - 18 September 2018
Solution
Categorisation as a service with

intent: tax, business insight, etc.
19 Same for car types:

we made a mistake when

serving the Recs Algorithm, because:

- one service said Crossovers where ‘SUVs’,

- another service said they were ‘Other cars’.

Traffic attribution is a set of categories,

but those categories might not be clear.

- Some search users type our brand name.

Is that part of the attribution group

“AdWords” or a separate “Brand” one? 

- What about reactivating users via AdWords?

- Let’s not talk about mobile.

All those distinctions are good ideas,

but if you change any of it, tell everyone.
Or better: build & share services

to handle that well for the whole company,

and update those services.

Garbage in, garbage out - 18 September 2018
4. Not delivering

leads to high retention.

Maybe not high LTV.
20 ===12 min==

What is the best way

to make sure that someone re-orders?

Let’s say we try to predict re-ordering,

using feature engineering and random forest.

Garbage in, garbage out - 18 September 2018
Best predictor of retention?
Denormalised
customer
id
delivery_address
traffic_attribution
…
latest_delivery
…
period_start
period_end
reorders_period
latest_failed_delivery - period_start < 3 hours

(Just after a failed delivery)
21 Basically,

- feature engineering means:

try every possible combination of

any variable that you have at given time t

- random forest roughly means:

look for the one, or multiple features that

corresponds the most to the target;

in this case, is there an order at time t,

or rather during the study period. 

The software will find quite rapidly:

- the difference in time between

the last failed delivery and t, the period start

is a really good signal.

- Said simply: people re-order

just after a failed delivery. 

It works great. It’s a great predictor.

So, should you mess up all deliveries

to increase your lifetime value? (No)
Garbage in, garbage out - 18 September 2018
Solution
Flag same day re-order

Create meta-entity order_intent
22 Do not naively use

the metrics that you are given

from the engineering schema.

Use the metrics that

match customer’s experience.

If they reorder after a failed order,

that’s the same intent.

Represent that as a single meta-entity
Garbage in, garbage out - 18 September 2018
5. Bundling:

Reproduce actual
interaction structure
23 Ok, those were four very naive situations.

Let’s have, for our last example

with an e-commerce website,

something a little more sophisticated.

You looked at frequency and quantity,

and you’ve decided to

recommend a special bulk offer

to your clients.
Garbage in, garbage out - 18 September 2018
Customer
id
delivery_address
email
…
Bundle
id
customer_id
quantity
quantity_left
…
Order
id
bundle_id
sticker_price
actual_price
…
Payment
id
order_id
currency
status
… Delivery
id
order_id
address
status
…
Transaction
order_id
timestamp
old_status
…
…
24 They can order three times five bags,

and receive the parcel every five weeks.

It works well for them,

it saves you delivery costs.

It’s great.

Your data people wants to tell if it works.

But you had to change your schema,

and add that bundle table.

Cool. So payments and delivery are…

Well, they pay once for the bundle,

so you start separating things.

Garbage in, garbage out - 18 September 2018
Customer
id
delivery_address
email
…
Bundle
id
customer_id
quantity
quantity_left
…
Order
id
bundle_id
sticker_price
actual_price
…
Transaction
bundle_id
timestamp
old_status
…
Payment
id
bundle_id
… Delivery
id
order_id
address
status
…
Status changes
delivery_id
timestamp
old_status
…
…
25 The transactions are dependent on payments.

But the status of the delivery

has to be handle by another table.

Or not — it’s confusing.

Can a customer cancel a failed delivery?

For the engineer who maintains the

denormalised customer table,

those changes break assumptions like:

one payment is one delivery.

You will see errors because of:

- Double-counting—like the first example,

- Odd RFM: large spends followed by

no activity, rather than regular spends.

This confusion can lead to

more bad data,

and more bad models.
Garbage in, garbage out - 18 September 2018
Solution
Detect & understand

schema changes
26 If you end up with

meta-entities containing multiple items.

1. Congratulations: they are better representations

2. You want to redefine key product metrics:

- Average duration, value?

- Retention

The best way to do that is

to invent and communicate widely

and significant schema change.
Garbage in, garbage out - 18 September 2018
Two made-up example
ProPowder
1. Cancellation, duplicates &
MECE recommendations

2. FX Conversion & the wealth

of Indonesia

3. Missing category, defaults &
improper likelihood

4. Not delivering & retention

5. Bundling: reproduce actual
interaction structure
FarmGame
6. Really fast farmers:

flag outliers to preserve logic

7. Time to achievements:

really slow buildup

8. Timing & incentives:

what do you measure? 

9. Forecasting with missing

or censored data

27 ===7 min==

How are we doing so far?

We have more than half-way done.

Let’s talk about video games.

Specifically casual video games like FarmVille!
Garbage in, garbage out - 18 September 2018
FarmGame
Very casual gaming
28 The oversimplified view of casual games is: they are essentially Clic-a-cow:

- If you clic on things, you get 1 “gold” point;

- with gold you buy beautiful “objects”.

That sounds trivial,

but it is enough to make it compelling.

You can add two types of pressure:

- Social pressure: to unlock special objects,

you need gestures from other players;

- Time pressure: certain buildings are only available through time limited
“missions”.

One early data science project is

to set time limits just hard enough

for people to find missions exciting.
Garbage in, garbage out - 18 September 2018
6. Really fast farmers:

Flag outliers to

preserve game logic
29 We want to know

how difficult certain missions were.

We know how many gold there require, but: 

- players play at different rhythm,

- they are several currencies: gold, crystals,

- suboptimal behaviour, cosmetic changes…

We want to be sure.

How fast can they complete a mission,

given its complexity.
Garbage in, garbage out - 18 September 2018
How fast are farmers?
• Time played /

asset collected

• “Oddly fast farmers”

aka Witches

• ML predict duration
Difficulty of the mission

Actions, Gold coins, etc.
Timetocomplete

hoursofplay,calendartime
30 You need example data and run a regression.

A regression is

the most simple model there is:

find the line that goes closest to all the points.

>> The problem is:

Some players find some missions too difficult.

They want the shiny Golden Cow, but

it’s easier to hack the game engine to

get all the resources instantly.

(Yes, bored pensioners learned CSS,

race conditions and dependency injections,

just to get a shiny cow in a game.)

That regression, with those outliers, is odd.

Try explaining to the game designer that

the most difficult missions are finished faster.

>> If you remove those outliers,

You get a more reasonable trend.

Any common examples of regressions?

- (Prices, price sensitivity)

- Do you filter for outliers?
Garbage in, garbage out - 18 September 2018
7. Really slow farmers:
Ordinal metrics to

time effort properly
31 Let’s assume that you manage to

fix your dependency injections in your game.

And you want to tell:

Given how difficult a mission is,

how many players are

going to finish a mission in time.
Garbage in, garbage out - 18 September 2018
Fastplayer
Slowplayer
Meanplay.
M
ission

start
N
ew


m
ission

start?
32 It’s very similar problem,

but you want to look into

how widespread is your player speed.



You want to set the mission duration

so that most people can finish in time.

Then they put social pressure

on the slower players

forcing them to pay hard cash

to finish the mission in time

and get the shiny golden cow too.
Garbage in, garbage out - 18 September 2018
Fastplayer
Slowplayer
Secondatt.
Thirdatt.
Secondatt.
New

mission

delayed
M
ission

start
33 But the thing is: 

The faster players

want to show that they are better,

and try to do the mission twice.

And slower players do to,

just because they can.
Garbage in, garbage out - 18 September 2018
Fastplayer
Slowplayer
Secondatt.
Thirdatt.
Secondatt.
N
ew


m
ission

delayed
M
easured

m
ea
tim
e
to
com
plete
M
ission

start
Actual

tim
e
to
com
plete
34 If you count the time it took

to complete the mission from

when you launched

to the moment they finished,

You are going to overestimate

because of those successive missions.

So either make the mission one-time only,

Or measure how long it takes the first time.

Or how long between two successful attempts.

Otherwise, you’ll make bad models.

That question,

when to start and when to stop,

is actually not trivial.
Garbage in, garbage out - 18 September 2018
8. Time intervals &
modelling incentives:

What do you measure?
35 In the two previous examples,

we assumed that we knew

how long it takes for someone

to complete a mission,

but that’s not always easy.

Imagine you want to encourage people

to complete their mission in a timely fashion.

You want to understand what

motivates them to work faster and well.
Garbage in, garbage out - 18 September 2018
Assignment
Sendreview
Open
Acceptance
Completion
Firstaction
Assignment to completion
Decision to known outcome
Work
Decide to
accept
Work on

mission
Review
Inactive
Wait
Read

review
result
Wait
Firstaction
Work on

mission
Wait
36 You don’t have a single Work interval.

And both you and the player

don’t have the same information

on either side at every time

If you want to represent people’s motivation, you need to think about

- when people make which decision,

- (to think about) incentives:

if they can delay acceptance

and if we reward fast work

>> they will lie about when they start.

- And finally,

measure accordingly

Garbage in, garbage out - 18 September 2018
Assignment
Sendreview
Open
Acceptance
Completion
Decide to
accept
Work on

the mission
Inactive
Wait
Decide
to
accept
Work on

the mission
Wait Review
Decide to
accept
Work on

mission
Review
Inactive
Wait
Read

review
result
Wait
Read

review
result
Inactive
Inactive
Firstaction
37
Don’t assume that,

if they are not responding,

this is because they are idle.

Often, they simply do something else,

like play another game.
Garbage in, garbage out - 18 September 2018
9. Forecasting

Crash or Over-activity
(missing & censored data)
38 ===3 min==

So far so good?

Our final example is very common.

Imagine you want to predict

how many players are playing

at the same time.
Garbage in, garbage out - 18 September 2018
0
15
30
45
60
01
Jan08
Jan15
Jan22
Jan29
Jan05
Feb12
Feb19
Feb26
Feb05
M
ar12
M
ar19
M
ar26
M
ar02
Apr09
Apr16
Apr23
Apr30
Apr
39 And this is

the number of players per hour

in the last four months.

Because of a bug,

there’s a missing week.

Are you going to use this data as is?

Take out the part before the hole?

Because this is cyclical,

you need more than one period.

You have to first

re-build inferred data with a model,

to train another model

on top of that inferred data.
Garbage in, garbage out - 18 September 2018
0
10
20
30
40
01
Jan08
Jan15
Jan22
Jan29
Jan05
Feb12
Feb19
Feb26
Feb05
M
ar12
M
ar19
M
ar26
M
ar02
Apr09
Apr16
Apr23
Apr30
Apr
What if we couldn’t predict?
40 That isn’t always possible.

Sometime,

the data that you have missing

is just not replaceable.

Imagine this is

how many players you have at a given time:

Looks like your server is saturating at 37.

For how many players

should you scale your servers?

We can’t really tell precisely,

but more than 37
Garbage in, garbage out - 18 September 2018
0
10
20
30
40
01
Jan08
Jan15
Jan22
Jan29
Jan05
Feb12
Feb19
Feb26
Feb05
M
ar12
M
ar19
M
ar26
M
ar02
Apr09
Apr16
Apr23
Apr30
Apr
What if we stop recording?
41 Finally, the most common scenario:

your game is super popular,

but if you hit the server limit,

it crashes the server, it crashes the game.

You log nothing at all,

until engineers put it back on.

How do you make a forecast

based on that data?

How many servers

do you think this game needs?

Sometimes, the best answer to bad data

is to say No.

You want to say:

“All I can do with that is guessing.

Once you have good data,

exhaustive enough data to do statistics, 

then we can help.”
Garbage in, garbage out - 18 September 2018
“This is stupid”
42 In summary,

To get data science to work,

there is a first step

that you can easily summarise by

a lot of “this is easy” moments –

issues that you should have thought about.

Count properly,

classify properly,

measure properly.

Statistics after that,

it’s actually rather easy.
Garbage in, garbage out - 18 September 2018
Questions?
ProPowder
1. Cancellation, duplicates &
MECE recommendations

2. FX Conversion & the wealth

of Indonesia

3. Missing category, defaults &
improper likelihood

4. Not delivering & retention

5. Bundling: reproduce actual
interaction structure
FarmGame
6. Really fast farmers:

flag outliers to preserve logic

7. Time to achievements:

really slow buildup

8. Timing & incentives:

what do you measure? 

9. Forecasting with missing

or censored data

43 Do you have any questions?

–––––

One question that I was asked last time was:

What ratio of all mistakes can

a system of audits, and checks,

and anomaly detection,

with proper services to handle categories,

how many errors can it catch?

I’ve worked at company where,

with constant effort, this was the vast majority.

Those companies are not better,

with more analysts, or more senior engineers.

There are companies

who still make a lot of mistakes

who would never let an error to waste.

If something, even minor, went wrong:

immediately, retro, improvement, more checks.

That is remarkable.

That’s what you really want to imitate.



Garbage in, garbage out - 18 September 2018

More Related Content

Similar to Garbage in, garbage out

1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptopRising Media, Inc.
 
So you think you know your business model?
So you think you know your business model?So you think you know your business model?
So you think you know your business model?Guillaume Decugis
 
StartupTalk #36 - Feedback Beyond the Buzz
StartupTalk #36 - Feedback Beyond the BuzzStartupTalk #36 - Feedback Beyond the Buzz
StartupTalk #36 - Feedback Beyond the BuzzPreSeed Ventures
 
Business Plan ICADDY et retail Analytics
Business Plan ICADDY et retail AnalyticsBusiness Plan ICADDY et retail Analytics
Business Plan ICADDY et retail AnalyticsMikaël Monjour
 
New2 building a business model
New2 building a business modelNew2 building a business model
New2 building a business modelZiya-B
 
Modelling for decisions
Modelling for decisionsModelling for decisions
Modelling for decisionscoppeliamla
 
Financefornon financialpersonnel-part8-150208202930-conversion-gate02
Financefornon financialpersonnel-part8-150208202930-conversion-gate02Financefornon financialpersonnel-part8-150208202930-conversion-gate02
Financefornon financialpersonnel-part8-150208202930-conversion-gate02Kristi Anderson
 
Finance for non financial personnel - part 8
Finance for non financial personnel - part 8Finance for non financial personnel - part 8
Finance for non financial personnel - part 8Quek Joo Chay
 
Bootstrapping Is Good; Having A Revenue Model Is Better
Bootstrapping Is Good; Having A Revenue Model Is BetterBootstrapping Is Good; Having A Revenue Model Is Better
Bootstrapping Is Good; Having A Revenue Model Is Bettercoffeeexpert
 
Gc issue 20 editor lorraine stylianou
Gc issue 20 editor lorraine stylianouGc issue 20 editor lorraine stylianou
Gc issue 20 editor lorraine stylianouLorraine Stylianou
 
The 8 Things Everyone Should Know About Startup Funding
The 8 Things Everyone Should Know About Startup FundingThe 8 Things Everyone Should Know About Startup Funding
The 8 Things Everyone Should Know About Startup FundingWilly Braun
 
StartupTalk #35 - How to Price your First SaaS Product
StartupTalk #35 - How to Price your First SaaS ProductStartupTalk #35 - How to Price your First SaaS Product
StartupTalk #35 - How to Price your First SaaS ProductPreSeed Ventures
 

Similar to Garbage in, garbage out (16)

1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop
 
So you think you know your business model?
So you think you know your business model?So you think you know your business model?
So you think you know your business model?
 
StartupTalk #36 - Feedback Beyond the Buzz
StartupTalk #36 - Feedback Beyond the BuzzStartupTalk #36 - Feedback Beyond the Buzz
StartupTalk #36 - Feedback Beyond the Buzz
 
Business Plan ICADDY et retail Analytics
Business Plan ICADDY et retail AnalyticsBusiness Plan ICADDY et retail Analytics
Business Plan ICADDY et retail Analytics
 
In Search of the Magic Number
In Search of the Magic NumberIn Search of the Magic Number
In Search of the Magic Number
 
New2 building a business model
New2 building a business modelNew2 building a business model
New2 building a business model
 
Modelling for decisions
Modelling for decisionsModelling for decisions
Modelling for decisions
 
Financefornon financialpersonnel-part8-150208202930-conversion-gate02
Financefornon financialpersonnel-part8-150208202930-conversion-gate02Financefornon financialpersonnel-part8-150208202930-conversion-gate02
Financefornon financialpersonnel-part8-150208202930-conversion-gate02
 
Finance for non financial personnel - part 8
Finance for non financial personnel - part 8Finance for non financial personnel - part 8
Finance for non financial personnel - part 8
 
Bootstrapping Is Good; Having A Revenue Model Is Better
Bootstrapping Is Good; Having A Revenue Model Is BetterBootstrapping Is Good; Having A Revenue Model Is Better
Bootstrapping Is Good; Having A Revenue Model Is Better
 
Ideate rodolfo howard
Ideate   rodolfo howardIdeate   rodolfo howard
Ideate rodolfo howard
 
Inverting brand paradigms
Inverting brand paradigmsInverting brand paradigms
Inverting brand paradigms
 
Gc issue 20 editor lorraine stylianou
Gc issue 20 editor lorraine stylianouGc issue 20 editor lorraine stylianou
Gc issue 20 editor lorraine stylianou
 
MA2017 | Chris habachy | Confessions of an Investor: Red Flags and Pitfalls W...
MA2017 | Chris habachy | Confessions of an Investor: Red Flags and Pitfalls W...MA2017 | Chris habachy | Confessions of an Investor: Red Flags and Pitfalls W...
MA2017 | Chris habachy | Confessions of an Investor: Red Flags and Pitfalls W...
 
The 8 Things Everyone Should Know About Startup Funding
The 8 Things Everyone Should Know About Startup FundingThe 8 Things Everyone Should Know About Startup Funding
The 8 Things Everyone Should Know About Startup Funding
 
StartupTalk #35 - How to Price your First SaaS Product
StartupTalk #35 - How to Price your First SaaS ProductStartupTalk #35 - How to Price your First SaaS Product
StartupTalk #35 - How to Price your First SaaS Product
 

More from Bertil Hatt

Five finger audit
Five finger auditFive finger audit
Five finger auditBertil Hatt
 
Are you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testAre you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testBertil Hatt
 
Prediction machines
Prediction machinesPrediction machines
Prediction machinesBertil Hatt
 
MancML Growth accounting
MancML Growth accountingMancML Growth accounting
MancML Growth accountingBertil Hatt
 
What to do to get started with AI
What to do to get started with AIWhat to do to get started with AI
What to do to get started with AIBertil Hatt
 

More from Bertil Hatt (6)

Five finger audit
Five finger auditFive finger audit
Five finger audit
 
AlexNet
AlexNetAlexNet
AlexNet
 
Are you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point testAre you ready for Data science? A 12 point test
Are you ready for Data science? A 12 point test
 
Prediction machines
Prediction machinesPrediction machines
Prediction machines
 
MancML Growth accounting
MancML Growth accountingMancML Growth accounting
MancML Growth accounting
 
What to do to get started with AI
What to do to get started with AIWhat to do to get started with AI
What to do to get started with AI
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

Garbage in, garbage out

  • 1. Rubbish in Rubbish out Nine real examples of bad data collection leading to bad machine learning models Bertil Hatt Data science 1 Innovation can be amazing.
 So amazing that’s it’s often seen as magical. But magic isn’t real.
 Innovation is often expected to leapfrog problems. That can lead to very painful results.
 In my experience, - not having good analytics, or - a partial understanding of your product can lead to bad surprises when innovating. The explanation goes Rubbish in, rubbish out. 
 But that sentence is often dropped like an absolute truth, a counterpoint as if it needs no detail, no explanation. Garbage in, garbage out - 18 September 2018
  • 2. Bad data is everywhere 2 This presentation is here to give
 examples of what actually happens
 when you try to apply machine learning
 on top of problematic data. This presentation is the distillation of a long, diverse experience at many different kind of institutions. I’ve seen the problems that I’m going to talk about,
 in many companies.
 
 I wanted to make this generic, and not specific to one circumstance or another. Garbage in, garbage out - 18 September 2018
  • 3. Two made-up example ProPowder 1. Cancellation, duplicates & MECE recommendations 2. FX Conversion & the wealth
 of Indonesia 3. Missing category, defaults & improper likelihood 4. Not delivering & retention 5. Bundling: reproduce actual interaction structure FarmGame 6. Really fast farmers:
 flag outliers to preserve logic 7. Time to achievements:
 really slow buildup 8. Timing & incentives:
 what do you measure? 9. Forecasting with missing
 or censored data 3 So I imagined two very simple fictional companies. much simpler than the companies
 that I used to worked for. One is selling things on-line:
 think of it as a generic e-commerce website.
 I picked protein powder,
 because that’s plain and boring. The other is a minimalist game studio. Nothing exotic: a FarmVille clone,
 like literally dozens of them. Garbage in, garbage out - 18 September 2018
  • 4. “This is stupid” 4 However, even with very basic structure, there are plenty of things that can go wrong. The reaction that I expect for most of these problems is: This is stupid. You obviously shouldn’t do things that way. I know. My point today is that to this is not hard. Garbage in, garbage out - 18 September 2018
  • 5. “This is stupid” • Most errors are stupid & easy to fix once you know them • Prioritise measured impact, bad data goes undetected • Bad data silently hurts your decisions & experiments 5 My point is that all of this is easy to forget, - either because you rely on junior people, - or just because you are tired. Expect data to be bad: it will always be in some way. But bad data isn’t always inherently bad: - a lot of times, the data is exact,
 just not well documented; - it can even be well documented
 but the analyst, the data scientist
 have overlooked its complexity. - Awareness is what matters;
 communications, jokes, veteran stories. This is my veteran story. Garbage in, garbage out - 18 September 2018
  • 6. Sell protein powder Naive e-commerce example 6 ===18 min== Imagine the simplest possible e-commerce website. You sell bags of protein powder. Put them in a box, ship it. That’s it. What kind of data science
 can help you do that? Garbage in, garbage out - 18 September 2018
  • 7. Possible uses
 of data science Customer lifetime spending • LTV>CPA: Lifetime value to set cost-per-acquisition • RFM: Recency, Frequency, aMount triggers reactivation • Recommend product; No variety, so bundling size 7 The only information that you have is
 how many times & when a customer orders, That means a lot of options in marketing: First a classic: computing the lifetime value of your users. - You can use that to estimate
 the value of future customers. - Once you have that, you can
 compare it to your cost-per-acquisition and
 decide which channels to invest in. Another good one is the rhythm of orders: - Have regular customers stopped ordering? - If that’s the case, you know
 to whom you should reach out. Finally, product recommendation. If all you have is the same protein… meh. You could decide to bundle
 into bigger orders, or long-term orders. We will see how that will affect your data. Garbage in, garbage out - 18 September 2018
  • 8. • Customer orders package • Pays for it (or fails to) • Deliveries can fail • Might cancel & reimburse, or
 Re-deliver the same order • International business 8 So, what do we have: Customers, orders. The customer should pay for it. - Payments may or may not work. - Fulfilment, that’s you, so that should be fine. - Deliveries might fail. If a delivery fails, - Some customers will want to be reimbursed; - Some will want a new delivery. And let’s say you have international customers. So: what does it take, on the code side? Garbage in, garbage out - 18 September 2018
  • 9. Customer id delivery_address email … Order id customer_id quantity status … Payment id order_id currency status … Delivery id order_id address status … Order statuses • Waiting payment • Fulfilling • On route • Delivered • Delivery failed • Cancelled Currency id currency_code fx_gbp last_update … Price quantity quantity price_gbp Transaction order_id timestamp old_status new_status … 9 You should all have a schema in your head. A very simple schema: Order, customer >> You probably want to normalise and
 have payment attempts and deliveries attempts into their own tables. >> For payment, you will need
 some reference tables: price, exchange rate. >> The thing is: orders and deliveries
 can go through a lot of statuses. 
 You probably want to track that too. Having mutable tables is dangerous. >> So let’s track of all transactions,
 at least financially relevant one. Note: even a simple case has fun questions: - where do you store the address:
 customer or order?
 How do you handle address changes, multiple addresses? Postcodes? Let’s ignore all that and focus on what we need for analytics and data science. Garbage in, garbage out - 18 September 2018
  • 10. Customer id delivery_address email … Transaction id customer_id price currency_code final_status … Denormalised customer id delivery_address country currency email nb_total_orders nb_cancellations fist_delivery_ts latest_delivery_ts … lifetime_value should_reactivate 10 Data folks probably should focus on two tables - Immutable Customers, and - All relevant Transactions in their final state. Let’s say we can aggregate transactions
 into a denormalised table of customers with all relevant data. That’s feature engineering. We’ll come back to this We want to make estimations per customer (in green): - How much are they going to spend,
 how much profit for us after two years?
 That’s lifetime value. - Have they recently slowed their orders? Could they be persuaded to stay?
 That’s the marketing tag. Garbage in, garbage out - 18 September 2018
  • 11. 1. MECE, cancellation, duplicates & recommendations 11 ===15 min== How could that possibly break? First example! Garbage in, garbage out - 18 September 2018
  • 12. Are Italians customers
 that much more valuables? transac- tion_id order_id country trans- action_- type amount 1 1 UK Create 10 2 2 Italy Create 20 3 2 Italy Cancel 20 4 3 UK Create 14 country # cust- omers LTV UK 2 12 Italy 1 40 12 You have the transaction table, and you aggregated it by country, Here’s a very simple version, how much each three customers in two country contribute. I’ll let you look in detail. Is there anything shocking?
 Is the single Italian customer really worth three and a half times more
 than the average British one? (No: a cancelled order was counted twice!) Garbage in, garbage out - 18 September 2018
  • 13. Solution Check that Total revenue = 
 Sum of revenue per user, country 13 How do you avoid having that kind of mistake? Audit your intermediary tables. Take your total revenue per country, per user
 and compare it to your total revenue overall. That should also reveal more subtle edge cases: like paid orders never delivered, partners who left without being paid, etc. Garbage in, garbage out - 18 September 2018
  • 14. 2. FX Conversion & the wealth of Indonesia 14 Second really simple example, very real this one. Garbage in, garbage out - 18 September 2018
  • 15. Should we drop the bank
 on growth in Indonesia? # Customer Revenue/ customer (USD) EURO 24,541 505 US 21,588 495 UK 8,665 299 Canada 1,547 877 Indonesia 2,452 7,682,540,030 India 9,574 533,333 15 Here is an estimation of how much revenue we got from different countries. Anything suspicious? (The foreign exchange ratio got inverted) Garbage in, garbage out - 18 September 2018
  • 16. Solution Check that Total revenue = 
 Sum of revenue per user, country 16 Same as previously: This is stupid That’s the point. Errors that end up breaking data science are not sophisticated most of the time. Same solution: audit your data. Garbage in, garbage out - 18 September 2018
  • 17. 3. Missing category, defaults & improper likelihood 17 Third error, a little more nuanced. Garbage in, garbage out - 18 September 2018
  • 18. Denormalised customer id delivery_address traffic_attribution currency email nb_total_orders nb_cancellations fist_delivery_ts latest_delivery_ts … lifetime_value should_reactivate LTV per Country country_id country_name # customer total_nb_orders avg order_amount avg lifetime_value 18 You want to aggregate a metric per country. You just need to assign
 a country to an address. That’s easy, right? But: what is a country? Is ‘Wales’ a country? 
 There should be internal services doing that, but they might not have the right intent:
 Tax, currency, language, traffic attribution, logistics, culture, business analysis? - Åland pays in Euros, but VAT is not Finnish. - French Polynesia has different everything
 but, the currency is pegged to the Euro. - Czech republic tried to rebrand as Czechia. - What about Northern Ireland?
 For all official business it is in the UK,
 but logistically, it’s on the Isle of Ireland. Having dedicated services that serve business-relevant groups or regions really help. Garbage in, garbage out - 18 September 2018
  • 19. Solution Categorisation as a service with
 intent: tax, business insight, etc. 19 Same for car types: we made a mistake when serving the Recs Algorithm, because: - one service said Crossovers where ‘SUVs’, - another service said they were ‘Other cars’. Traffic attribution is a set of categories, but those categories might not be clear. - Some search users type our brand name.
 Is that part of the attribution group
 “AdWords” or a separate “Brand” one? - What about reactivating users via AdWords? - Let’s not talk about mobile. All those distinctions are good ideas, but if you change any of it, tell everyone. Or better: build & share services
 to handle that well for the whole company, and update those services. Garbage in, garbage out - 18 September 2018
  • 20. 4. Not delivering
 leads to high retention.
 Maybe not high LTV. 20 ===12 min== What is the best way to make sure that someone re-orders? Let’s say we try to predict re-ordering, using feature engineering and random forest. Garbage in, garbage out - 18 September 2018
  • 21. Best predictor of retention? Denormalised customer id delivery_address traffic_attribution … latest_delivery … period_start period_end reorders_period latest_failed_delivery - period_start < 3 hours (Just after a failed delivery) 21 Basically, - feature engineering means:
 try every possible combination of
 any variable that you have at given time t - random forest roughly means:
 look for the one, or multiple features that
 corresponds the most to the target;
 in this case, is there an order at time t,
 or rather during the study period. The software will find quite rapidly: - the difference in time between
 the last failed delivery and t, the period start
 is a really good signal. - Said simply: people re-order
 just after a failed delivery. It works great. It’s a great predictor. So, should you mess up all deliveries to increase your lifetime value? (No) Garbage in, garbage out - 18 September 2018
  • 22. Solution Flag same day re-order Create meta-entity order_intent 22 Do not naively use the metrics that you are given from the engineering schema. Use the metrics that match customer’s experience. If they reorder after a failed order, that’s the same intent.
 Represent that as a single meta-entity Garbage in, garbage out - 18 September 2018
  • 23. 5. Bundling:
 Reproduce actual interaction structure 23 Ok, those were four very naive situations. Let’s have, for our last example with an e-commerce website, something a little more sophisticated. You looked at frequency and quantity, and you’ve decided to recommend a special bulk offer to your clients. Garbage in, garbage out - 18 September 2018
  • 24. Customer id delivery_address email … Bundle id customer_id quantity quantity_left … Order id bundle_id sticker_price actual_price … Payment id order_id currency status … Delivery id order_id address status … Transaction order_id timestamp old_status … … 24 They can order three times five bags, and receive the parcel every five weeks. It works well for them, it saves you delivery costs. It’s great. Your data people wants to tell if it works. But you had to change your schema, and add that bundle table. Cool. So payments and delivery are… Well, they pay once for the bundle, so you start separating things. Garbage in, garbage out - 18 September 2018
  • 25. Customer id delivery_address email … Bundle id customer_id quantity quantity_left … Order id bundle_id sticker_price actual_price … Transaction bundle_id timestamp old_status … Payment id bundle_id … Delivery id order_id address status … Status changes delivery_id timestamp old_status … … 25 The transactions are dependent on payments. But the status of the delivery has to be handle by another table. Or not — it’s confusing. Can a customer cancel a failed delivery? For the engineer who maintains the
 denormalised customer table, those changes break assumptions like:
 one payment is one delivery. You will see errors because of: - Double-counting—like the first example, - Odd RFM: large spends followed by
 no activity, rather than regular spends. This confusion can lead to
 more bad data, and more bad models. Garbage in, garbage out - 18 September 2018
  • 26. Solution Detect & understand
 schema changes 26 If you end up with meta-entities containing multiple items. 1. Congratulations: they are better representations 2. You want to redefine key product metrics: - Average duration, value? - Retention The best way to do that is
 to invent and communicate widely
 and significant schema change. Garbage in, garbage out - 18 September 2018
  • 27. Two made-up example ProPowder 1. Cancellation, duplicates & MECE recommendations 2. FX Conversion & the wealth
 of Indonesia 3. Missing category, defaults & improper likelihood 4. Not delivering & retention 5. Bundling: reproduce actual interaction structure FarmGame 6. Really fast farmers:
 flag outliers to preserve logic 7. Time to achievements:
 really slow buildup 8. Timing & incentives:
 what do you measure? 9. Forecasting with missing
 or censored data 27 ===7 min== How are we doing so far? We have more than half-way done. Let’s talk about video games.
 Specifically casual video games like FarmVille! Garbage in, garbage out - 18 September 2018
  • 28. FarmGame Very casual gaming 28 The oversimplified view of casual games is: they are essentially Clic-a-cow: - If you clic on things, you get 1 “gold” point; - with gold you buy beautiful “objects”. That sounds trivial,
 but it is enough to make it compelling. You can add two types of pressure: - Social pressure: to unlock special objects,
 you need gestures from other players; - Time pressure: certain buildings are only available through time limited “missions”. One early data science project is to set time limits just hard enough for people to find missions exciting. Garbage in, garbage out - 18 September 2018
  • 29. 6. Really fast farmers:
 Flag outliers to
 preserve game logic 29 We want to know how difficult certain missions were. We know how many gold there require, but: - players play at different rhythm, - they are several currencies: gold, crystals, - suboptimal behaviour, cosmetic changes… We want to be sure. How fast can they complete a mission, given its complexity. Garbage in, garbage out - 18 September 2018
  • 30. How fast are farmers? • Time played /
 asset collected • “Oddly fast farmers”
 aka Witches • ML predict duration Difficulty of the mission
 Actions, Gold coins, etc. Timetocomplete
 hoursofplay,calendartime 30 You need example data and run a regression. A regression is
 the most simple model there is:
 find the line that goes closest to all the points. >> The problem is:
 Some players find some missions too difficult. They want the shiny Golden Cow, but it’s easier to hack the game engine to get all the resources instantly. (Yes, bored pensioners learned CSS,
 race conditions and dependency injections,
 just to get a shiny cow in a game.)
 That regression, with those outliers, is odd.
 Try explaining to the game designer that the most difficult missions are finished faster. >> If you remove those outliers,
 You get a more reasonable trend. Any common examples of regressions?
 - (Prices, price sensitivity) - Do you filter for outliers? Garbage in, garbage out - 18 September 2018
  • 31. 7. Really slow farmers: Ordinal metrics to
 time effort properly 31 Let’s assume that you manage to fix your dependency injections in your game.
 And you want to tell: Given how difficult a mission is, how many players are going to finish a mission in time. Garbage in, garbage out - 18 September 2018
  • 32. Fastplayer Slowplayer Meanplay. M ission
 start N ew 
 m ission
 start? 32 It’s very similar problem,
 but you want to look into how widespread is your player speed.
 
 You want to set the mission duration so that most people can finish in time. Then they put social pressure on the slower players forcing them to pay hard cash to finish the mission in time and get the shiny golden cow too. Garbage in, garbage out - 18 September 2018
  • 33. Fastplayer Slowplayer Secondatt. Thirdatt. Secondatt. New
 mission
 delayed M ission
 start 33 But the thing is: The faster players want to show that they are better, and try to do the mission twice. And slower players do to, just because they can. Garbage in, garbage out - 18 September 2018
  • 34. Fastplayer Slowplayer Secondatt. Thirdatt. Secondatt. N ew 
 m ission
 delayed M easured
 m ea tim e to com plete M ission
 start Actual
 tim e to com plete 34 If you count the time it took to complete the mission from when you launched to the moment they finished, You are going to overestimate because of those successive missions. So either make the mission one-time only,
 Or measure how long it takes the first time. Or how long between two successful attempts. Otherwise, you’ll make bad models. That question, when to start and when to stop, is actually not trivial. Garbage in, garbage out - 18 September 2018
  • 35. 8. Time intervals & modelling incentives:
 What do you measure? 35 In the two previous examples, we assumed that we knew how long it takes for someone to complete a mission, but that’s not always easy. Imagine you want to encourage people to complete their mission in a timely fashion. You want to understand what motivates them to work faster and well. Garbage in, garbage out - 18 September 2018
  • 36. Assignment Sendreview Open Acceptance Completion Firstaction Assignment to completion Decision to known outcome Work Decide to accept Work on
 mission Review Inactive Wait Read
 review result Wait Firstaction Work on
 mission Wait 36 You don’t have a single Work interval. And both you and the player don’t have the same information on either side at every time If you want to represent people’s motivation, you need to think about - when people make which decision, - (to think about) incentives:
 if they can delay acceptance
 and if we reward fast work >> they will lie about when they start. - And finally,
 measure accordingly Garbage in, garbage out - 18 September 2018
  • 37. Assignment Sendreview Open Acceptance Completion Decide to accept Work on
 the mission Inactive Wait Decide to accept Work on
 the mission Wait Review Decide to accept Work on
 mission Review Inactive Wait Read
 review result Wait Read
 review result Inactive Inactive Firstaction 37 Don’t assume that, if they are not responding, this is because they are idle. Often, they simply do something else, like play another game. Garbage in, garbage out - 18 September 2018
  • 38. 9. Forecasting
 Crash or Over-activity (missing & censored data) 38 ===3 min== So far so good? Our final example is very common. Imagine you want to predict how many players are playing at the same time. Garbage in, garbage out - 18 September 2018
  • 39. 0 15 30 45 60 01 Jan08 Jan15 Jan22 Jan29 Jan05 Feb12 Feb19 Feb26 Feb05 M ar12 M ar19 M ar26 M ar02 Apr09 Apr16 Apr23 Apr30 Apr 39 And this is the number of players per hour in the last four months. Because of a bug,
 there’s a missing week. Are you going to use this data as is? Take out the part before the hole? Because this is cyclical, you need more than one period. You have to first re-build inferred data with a model, to train another model on top of that inferred data. Garbage in, garbage out - 18 September 2018
  • 40. 0 10 20 30 40 01 Jan08 Jan15 Jan22 Jan29 Jan05 Feb12 Feb19 Feb26 Feb05 M ar12 M ar19 M ar26 M ar02 Apr09 Apr16 Apr23 Apr30 Apr What if we couldn’t predict? 40 That isn’t always possible. Sometime, the data that you have missing is just not replaceable. Imagine this is how many players you have at a given time: Looks like your server is saturating at 37. For how many players should you scale your servers? We can’t really tell precisely, but more than 37 Garbage in, garbage out - 18 September 2018
  • 41. 0 10 20 30 40 01 Jan08 Jan15 Jan22 Jan29 Jan05 Feb12 Feb19 Feb26 Feb05 M ar12 M ar19 M ar26 M ar02 Apr09 Apr16 Apr23 Apr30 Apr What if we stop recording? 41 Finally, the most common scenario:
 your game is super popular, but if you hit the server limit, it crashes the server, it crashes the game. You log nothing at all,
 until engineers put it back on. How do you make a forecast
 based on that data? How many servers do you think this game needs? Sometimes, the best answer to bad data is to say No. You want to say: “All I can do with that is guessing. Once you have good data, exhaustive enough data to do statistics, then we can help.” Garbage in, garbage out - 18 September 2018
  • 42. “This is stupid” 42 In summary, To get data science to work, there is a first step that you can easily summarise by a lot of “this is easy” moments – issues that you should have thought about. Count properly, classify properly, measure properly. Statistics after that, it’s actually rather easy. Garbage in, garbage out - 18 September 2018
  • 43. Questions? ProPowder 1. Cancellation, duplicates & MECE recommendations 2. FX Conversion & the wealth
 of Indonesia 3. Missing category, defaults & improper likelihood 4. Not delivering & retention 5. Bundling: reproduce actual interaction structure FarmGame 6. Really fast farmers:
 flag outliers to preserve logic 7. Time to achievements:
 really slow buildup 8. Timing & incentives:
 what do you measure? 9. Forecasting with missing
 or censored data 43 Do you have any questions? ––––– One question that I was asked last time was:
 What ratio of all mistakes can
 a system of audits, and checks,
 and anomaly detection, with proper services to handle categories,
 how many errors can it catch?
 I’ve worked at company where, with constant effort, this was the vast majority. Those companies are not better, with more analysts, or more senior engineers. There are companies
 who still make a lot of mistakes who would never let an error to waste. If something, even minor, went wrong:
 immediately, retro, improvement, more checks. That is remarkable. That’s what you really want to imitate. 
 Garbage in, garbage out - 18 September 2018