Growth accounting &
Time-based data structures
How to get re-usable datasets (1)
Bertil Hatt
Data science @ RentalCars, Booking.com
Previously on MancML…
Three structures
1. Separating growth in In vs. out
2. Maturity level of departures
3. Retention losenge
1. Unemployment US vs. France
2. How to fix a casual video game
3. Great startup vs. bonfire
Three stories
No sophisticated models
How to structure data
1. Accounting for growth
Separating In vs. Out
Similar unemployment pattern
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment US
Unemployment (M)
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment France
Unemployment (M)
Numbers are made-up; for real ones, go check Labor Economics, The MIT Press 2004 Pierre Cahuc, André Zylberberg
Very different issues
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment US
Unemployment (M) Lost job Found job
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Unemployment France
Unemployment (M) Lost job Found job
Numbers are made-up; for real ones, go check Labor Economics, The MIT Press 2004 Pierre Cahuc, André Zylberberg
How to build a detailed reference table
Period (day,
week, month)
User
ID
Present or
Active this
period
Present or
Active last
period
Last active
(period)
Status
2018-01-01 12345 TRUE NULL NULL New
2018-01-08 12345 TRUE TRUE 2018-01-01 Active
2018-01-15 12345 FALSE TRUE 2018-01-08 Lapsed
2018-01-22 12345 FALSE FALSE 2018-01-08 Lost
2018-01-29 12345 TRUE FALSE 2018-01-08 Re-activated
…
SELECT … AS period, id, CASE WHEN… LAG(…) OVER MAX(…) OVER CASE WHEN…
GROUP BY period, id
w AS WINDOW…
How to build an aggregated reference table
Period (day,
week, month)
Status Count
Last
active
Source
Last
action
2018-01-01 New 17 854
2018-01-08 Active 78 442
2018-01-15 Lapsed 12 325
2018-01-22 Lost 10 548
2018-01-29 Re-activated 2 428
SELECT … AS period, status, COUNT()
GROUP BY period, status
2. Maturity levels of departures
When are people leaving
Distinct user status allow better insight
0
1
2
3
4
5
6
Players
Daily active
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Players funnel
Lost Active
Numbers are made-up; they look nothing like a project I worked on.
0
1
2
3
4
5
6
Players
Daily active New Active Lost
3.Seniority triangle &
Retention lozenge
How to represent users’ experience
Cohort
• a group of people with a shared characteristic (Cambridge Eng. Dict.)
• a group of people who did something all during the same period (Me)
• Don’t focus exclusively on registration: first order, or third, re-activation, etc.
Triangle of user experience
Timeofthefirstactionorregistration
Cohort
Promotion
NowTime of the action
Retention Losange
Time of the action
Timeofthefirstactionorregistration
Too recently
acquired
Now
8 weeks
After
8 weeks
Avoid retention bias
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
Cohort retention
Last week 8th week
More considerations
• Arbitrary thresholds
• Simple, imperfect, memorable
• Communicate: catchy names
• Alex Schultz, VP Growth Facebook
• More time-like metrics
• Activity totals vs. Behaviour step
• Time spent vs. since registration
• Demographic age vs. seniority
• Experience on wider platform
• Friends’ experience levels
MancML Growth accounting

MancML Growth accounting

  • 1.
    Growth accounting & Time-baseddata structures How to get re-usable datasets (1) Bertil Hatt Data science @ RentalCars, Booking.com
  • 3.
  • 4.
    Three structures 1. Separatinggrowth in In vs. out 2. Maturity level of departures 3. Retention losenge 1. Unemployment US vs. France 2. How to fix a casual video game 3. Great startup vs. bonfire Three stories No sophisticated models How to structure data
  • 5.
    1. Accounting forgrowth Separating In vs. Out
  • 6.
    Similar unemployment pattern 0 1 2 3 4 5 6 Q1Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Unemployment US Unemployment (M) 0 1 2 3 4 5 6 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Unemployment France Unemployment (M) Numbers are made-up; for real ones, go check Labor Economics, The MIT Press 2004 Pierre Cahuc, André Zylberberg
  • 7.
    Very different issues 0 1 2 3 4 5 6 Q1Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Unemployment US Unemployment (M) Lost job Found job 0 1 2 3 4 5 6 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Unemployment France Unemployment (M) Lost job Found job Numbers are made-up; for real ones, go check Labor Economics, The MIT Press 2004 Pierre Cahuc, André Zylberberg
  • 8.
    How to builda detailed reference table Period (day, week, month) User ID Present or Active this period Present or Active last period Last active (period) Status 2018-01-01 12345 TRUE NULL NULL New 2018-01-08 12345 TRUE TRUE 2018-01-01 Active 2018-01-15 12345 FALSE TRUE 2018-01-08 Lapsed 2018-01-22 12345 FALSE FALSE 2018-01-08 Lost 2018-01-29 12345 TRUE FALSE 2018-01-08 Re-activated … SELECT … AS period, id, CASE WHEN… LAG(…) OVER MAX(…) OVER CASE WHEN… GROUP BY period, id w AS WINDOW…
  • 9.
    How to buildan aggregated reference table Period (day, week, month) Status Count Last active Source Last action 2018-01-01 New 17 854 2018-01-08 Active 78 442 2018-01-15 Lapsed 12 325 2018-01-22 Lost 10 548 2018-01-29 Re-activated 2 428 SELECT … AS period, status, COUNT() GROUP BY period, status
  • 10.
    2. Maturity levelsof departures When are people leaving
  • 11.
    Distinct user statusallow better insight 0 1 2 3 4 5 6 Players Daily active 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% Players funnel Lost Active Numbers are made-up; they look nothing like a project I worked on. 0 1 2 3 4 5 6 Players Daily active New Active Lost
  • 12.
    3.Seniority triangle & Retentionlozenge How to represent users’ experience
  • 13.
    Cohort • a groupof people with a shared characteristic (Cambridge Eng. Dict.) • a group of people who did something all during the same period (Me) • Don’t focus exclusively on registration: first order, or third, re-activation, etc.
  • 14.
    Triangle of userexperience Timeofthefirstactionorregistration Cohort Promotion NowTime of the action
  • 15.
    Retention Losange Time ofthe action Timeofthefirstactionorregistration Too recently acquired Now 8 weeks After 8 weeks
  • 16.
  • 17.
    More considerations • Arbitrarythresholds • Simple, imperfect, memorable • Communicate: catchy names • Alex Schultz, VP Growth Facebook • More time-like metrics • Activity totals vs. Behaviour step • Time spent vs. since registration • Demographic age vs. seniority • Experience on wider platform • Friends’ experience levels

Editor's Notes

  • #2 This talk is probably part of a series I started suggesting a list of 12 questions to ask How mature a data organization is? This is a small step on how You can be a bit more systematic in handling your data
  • #3 Long way to say I’m old and cranky Small change since last time
  • #4 Last presentation I talked about 12 things that Needed to be there to make the small core part of data science Mainly, it’s a good ETL process and good habits around it A lot of that is good engineering and addressing analyst frustrations One of the most impactful part is reusable data structure This is something that fewer people in your organization are likely to ask Because it’s a less glaring pain point But enforcing consistent concept is important
  • #5 No sophisticated models How to structure data I’m more than happy to talk about convoluted models counter-intuitive corrections
  • #6 Learned last week Facebook self-credited Growth-accounting This, I learned during my Master’s Before Facebook founded Well. One of my Master’s
  • #7 Let’s have fun! Let’s talk about unemployment!
  • #8 You see, more jobs are created that destroyed When unemployment goes down. Same thing in both economies. But how you get there is not the same What is essential is that all things add up exactly
  • #9  The fact that a user is considered lapsed or lost after X or Y period is rather arbitrary Try to pick a number that separates well: few people transition from one to the other Once you have your Period and your status You can…
  • #10 Once you have your Period and your status You can… You can also draw transition graph You can and you should add a lot of things to that group by: you should have detailed totals by as many dimensions as you can think is relevant —————— Do you have any questions so far? Does this make sense to you? Do you see the applications to Data science?
  • #11 Now we have a relevant distinction in our population We can now train models trying to predict it! we can compare leavers with non-leaving customers With the same maturity or experience Keep in mind: any departure is temporary All cut-off is arbitrary That doesn’t matter What matters is that everyone is accounted for The key feature so far is how status add up to the active customers.
  • #12 Would you invest in a company with that kind of growth? Now, let’s apply the framework Of course, you know about funnels & retention and you would have caught that But even the funnel can be different with that insight: big drops might be relevant, but not all obstacles are hostile Good challenge Do not confuse progress and retained
  • #13 The wider idea in this talk is: Time is a key dimension in any product experience Cohorts are a great abstraction You need to come with more vocabulary around Two geometric arguments that explain a non-trivial concept
  • #14 The key notion in those two part, And a word that I have oddly not used so far has been a cohort What is a cohort? What is the widest, most simple definition of a cohort
  • #15 Let’s represent every action that your users on a two-dimensional plan X, or abscissa, is the time of the action Y, or ordinates, is the time that user registered or first did something In this corner, you can’t have any action because that would involve time travel Or some sort of weird CGI. Don’t go there. Or rather, do do there: count how many actions you have in there That’s a great thing to check. If there are, you know you have a problem This graph is actually used commonly to represent retention, Using colour maps
  • #16 This triangle becomes very useful when you are studying retention Same brown time travelers, still empty, hopefully If you want to know if people are retained after, say, eight weeks, you should exclude recent joiners And most people remember to do that. What people often overlook is to exclude activity * that we know about * after eight weeks of stay That losenge is what you should be looking at
  • #17 Once again, you are probably better off looking at a detailed colour maps But if you are trying to model retention, this is important. Let's assume you want to know if People who joined on week 1 are st And you should be making sure that finding the right number to model is easier to find. That’s the big lesson here: * Make the right metric easier to find *
  • #18 Once again, Not directly related to machine learning but getting those right is essential to building a relevant model Two things I would like to just say a word about First is: all those thresholds are arbitrary The right unit might not be calendar time Think about alternative way of counting Time in your service & calendar time since - Or think beyond what you see: wider platform, relations
  • #19 Do you have questions? Or would you rather have me give the floor to someone smarter than me?