1
Data Janitor 101
Daniel Molnar, Microsoft
Data Natives 2016
2
tl;dr
3
tl;dr
4 KISS is the philosophy,
3
tl;dr
4 KISS is the philosophy,
4 take the long view, invest in durable knowledge,
3
tl;dr
4 KISS is the philosophy,
4 take the long view, invest in durable knowledge,
4 strive for fast and good enough,
3
tl;dr
4 KISS is the philosophy,
4 take the long view, invest in durable knowledge,
4 strive for fast and good enough,
4 just because you can doesn't mean you should.
3
CAP #1
BUSINESS ANALYST
4
"... American MBA? ... if
you don’t understand
something it must be simple
and only take five
minutes."
1
Sean Murphy, PingThings
5
Don't
6
Don't
4 unicorn my a**,
6
Don't
4 unicorn my a**,
4 hockey stick here for me,
6
Don't
4 unicorn my a**,
4 hockey stick here for me,
4 skip leg day.
6
Do
7
Do
4 make definitions,
7
Do
4 make definitions,
4 show direction,
7
Do
4 make definitions,
4 show direction,
4 care about data quality,
7
Do
4 make definitions,
4 show direction,
4 care about data quality,
4 rule dashboards.
7
KPIs that matter
8
KPIs that matter
4 DAU, WAU, MAU, LTV, churn,
8
KPIs that matter
4 DAU, WAU, MAU, LTV, churn,
4 cohorts, segments, funnels,
8
KPIs that matter
4 DAU, WAU, MAU, LTV, churn,
4 cohorts, segments, funnels,
4 first hour, first day.
8
Approach
9
Approach
4 KPIs must hurt (aka no feelgood metrics),
9
Approach
4 KPIs must hurt (aka no feelgood metrics),
4 you are what you measure,
9
Approach
4 KPIs must hurt (aka no feelgood metrics),
4 you are what you measure,
4 you can run in one direction,
9
Approach
4 KPIs must hurt (aka no feelgood metrics),
4 you are what you measure,
4 you can run in one direction,
4 is it actionable (the Friday 1700 test).
9
Toolset
10
Toolset
4 Excel,
10
Toolset
4 Excel,
4 SQL,
10
Toolset
4 Excel,
4 SQL,
4 Metabase.
10
Heroes of the day
Joel Spolsky: You Suck at Excel
Dan McKinley: Data Driven Products Now!
11
CAP #2
DATA ENGINEER
12
"Don't reinvent
the flat tyre."
1
Alan Kay
13
Don't
14
Don't
4 just Apache it,
14
Don't
4 just Apache it,
4 build a Hadoop JENGA (10x-235x slow),
14
Don't
4 just Apache it,
4 build a Hadoop JENGA (10x-235x slow),
4 real-time it,
14
Don't
4 just Apache it,
4 build a Hadoop JENGA (10x-235x slow),
4 real-time it,
4 stream it,
14
Don't
4 just Apache it,
4 build a Hadoop JENGA (10x-235x slow),
4 real-time it,
4 stream it,
4 overengineer it.
14
Do
15
Do
4 embrace dirty reality
(entity recognition makes a data engineer),
15
Do
4 embrace dirty reality
(entity recognition makes a data engineer),
4 ETL, events and DWH,
15
Do
4 embrace dirty reality
(entity recognition makes a data engineer),
4 ETL, events and DWH,
4 data quality (know your leakage),
15
Do
4 embrace dirty reality
(entity recognition makes a data engineer),
4 ETL, events and DWH,
4 data quality (know your leakage),
4 testing (yes, you can even unit test data).
15
Approach
16
Approach
4 avoid GIGO,
16
Approach
4 avoid GIGO,
4 pedal to the metal, skip the overhead,
16
Approach
4 avoid GIGO,
4 pedal to the metal, skip the overhead,
4 know that big RAM is eating big data,
16
Approach
4 avoid GIGO,
4 pedal to the metal, skip the overhead,
4 know that big RAM is eating big data,
4 use open source, pragmatic, cloud service agnostic
tools.
16
Toolset
17
Toolset
4 UNIX (bash, make),
17
Toolset
4 UNIX (bash, make),
4 Python,
17
Toolset
4 UNIX (bash, make),
4 Python,
4 SQL,
17
Toolset
4 UNIX (bash, make),
4 Python,
4 SQL,
4 ETL in batch (mETL, night-shift)
17
Toolset
4 UNIX (bash, make),
4 Python,
4 SQL,
4 ETL in batch (mETL, night-shift)
4 event tracking (Hamustro, logsanitizer, RPi?),
17
Toolset
4 UNIX (bash, make),
4 Python,
4 SQL,
4 ETL in batch (mETL, night-shift)
4 event tracking (Hamustro, logsanitizer, RPi?),
4 DWH = MPP SQL (Azure DWH, Redshift, Vertica...).
17
Heroes of the day
James Mickens: Computers are a Sadness, I am the Cure
Dan McKinley: Choose Boring Technology
David Beazley: Discovering Python
18
CAP #3
DATA SCIENTIST
19
"Friends don’t let friends
calculate p-values
(without fully
understanding them)."
1
Scott Weingart
20
Don't
21
Don't
4 expect CSVs and produce models whatever it takes,
21
Don't
4 expect CSVs and produce models whatever it takes,
4 expect that you have to explore the laws of Universe,
21
Don't
4 expect CSVs and produce models whatever it takes,
4 expect that you have to explore the laws of Universe,
4 forget about Occam's razor,
21
Don't
4 expect CSVs and produce models whatever it takes,
4 expect that you have to explore the laws of Universe,
4 forget about Occam's razor,
4 A/B test (only if it REALLY REALLY makes sense).
21
Do
22
Do
4 user testing to define context (usertesting.com),
22
Do
4 user testing to define context (usertesting.com),
4 talk to users via surveys,
22
Do
4 user testing to define context (usertesting.com),
4 talk to users via surveys,
4 embed yourself in departments (personas),
22
Do
4 user testing to define context (usertesting.com),
4 talk to users via surveys,
4 embed yourself in departments (personas),
4 have common sense.
22
Approach
23
Approach
4 you mostly tell what not to do,
23
Approach
4 you mostly tell what not to do,
4 it's hard, but still the only way,
23
Approach
4 you mostly tell what not to do,
4 it's hard, but still the only way,
4 persist when not finding anything or trivialities,
23
Approach
4 you mostly tell what not to do,
4 it's hard, but still the only way,
4 persist when not finding anything or trivialities,
4 kill teh lurking causation.
23
A/B
24
A/B
4 think twice about TCO,
24
A/B
4 think twice about TCO,
4 the world isn’t identically distributed,
24
A/B
4 think twice about TCO,
4 the world isn’t identically distributed,
4 random variation will cheat you in small samples,
24
A/B
4 think twice about TCO,
4 the world isn’t identically distributed,
4 random variation will cheat you in small samples,
4 most A/B test results are illusory,
24
A/B
4 think twice about TCO,
4 the world isn’t identically distributed,
4 random variation will cheat you in small samples,
4 most A/B test results are illusory,
4 small data -> go Bayesian = less certainty.
24
Toolset
25
Toolset
4 SQL,
25
Toolset
4 SQL,
4 Wizard,
25
Toolset
4 SQL,
4 Wizard,
4 Python,
25
Toolset
4 SQL,
4 Wizard,
4 Python,
4 R (only to anger CS peeps).
25
Heroes of the day
Evan Miller: Wizard Statistical Analyzer
Chris Stucchio talks and posts on testing
26
Machine Learning
CAP #4
27
Don't
28
Don't
4 need a PhD,
28
Don't
4 need a PhD,
4 develop new unique matrix algos, please,
28
Don't
4 need a PhD,
4 develop new unique matrix algos, please,
4 need more than Excel,
28
Don't
4 need a PhD,
4 develop new unique matrix algos, please,
4 need more than Excel,
4 give false hope.
28
Do
29
Do
4 deploy good enough fast,
29
Do
4 deploy good enough fast,
4 copy Kaggle (ensembles, random forest, XGBoost),
29
Do
4 deploy good enough fast,
4 copy Kaggle (ensembles, random forest, XGBoost),
4 feature engineer,
29
Do
4 deploy good enough fast,
4 copy Kaggle (ensembles, random forest, XGBoost),
4 feature engineer,
4 build core data/feature (augment and enhance).
29
Approach
30
Approach
4 the Mailchimp way
(offline built model redeployed each quarter),
30
Approach
4 the Mailchimp way
(offline built model redeployed each quarter),
4 hybrid approaches (domain expert, vanilla ML),
30
Approach
4 the Mailchimp way
(offline built model redeployed each quarter),
4 hybrid approaches (domain expert, vanilla ML),
4 you are a machine instructor,
30
Approach
4 the Mailchimp way
(offline built model redeployed each quarter),
4 hybrid approaches (domain expert, vanilla ML),
4 you are a machine instructor,
4 Tensorflow (logic to clients, handle models).
30
Toolset
31
Toolset
4 Excel,
31
Toolset
4 Excel,
4 Wizard,
31
Toolset
4 Excel,
4 Wizard,
4 BigML,
31
Toolset
4 Excel,
4 Wizard,
4 BigML,
4 Python.
31
Heroes of the day
John Foreman: Data Smart
Jeroen Janssen: Data Science at the Command Line
32
CAP #5
HEAD OF DATA
33
"In god we trust
everybody else bring
data to the table."
1
W. Edwards Deming
34
Don't
35
Don't
4 believe the hype,
35
Don't
4 believe the hype,
4 trust no-one, just benchmarks,
35
Don't
4 believe the hype,
4 trust no-one, just benchmarks,
4 let black box take over,
35
Don't
4 believe the hype,
4 trust no-one, just benchmarks,
4 let black box take over,
4 expect hiring to be easy.
35
Do
36
Do
4 maintain data mythology,
36
Do
4 maintain data mythology,
4 keep the view backwards straight,
36
Do
4 maintain data mythology,
4 keep the view backwards straight,
4 expect emotions,
36
Do
4 maintain data mythology,
4 keep the view backwards straight,
4 expect emotions,
4 see the future.
36
Approach
37
Approach
4 train to be the bearer of the bad news,
37
Approach
4 train to be the bearer of the bad news,
4 laugh at endless growth without saturation,
37
Approach
4 train to be the bearer of the bad news,
4 laugh at endless growth without saturation,
4 handle the cargo cult (inverse causality).
37
Marketing
38
Marketing
4 Google Analytics (sampling, off by 20%, no user
granularity, no raw, 150k per year),
38
Marketing
4 Google Analytics (sampling, off by 20%, no user
granularity, no raw, 150k per year),
4 CPA, FB CPA, mobile CPA, conversion, attribution,
38
Marketing
4 Google Analytics (sampling, off by 20%, no user
granularity, no raw, 150k per year),
4 CPA, FB CPA, mobile CPA, conversion, attribution,
4 Net Promoter Score.
38
Heroes of the day
Dan Lyons: Disrupted
Venkatesh Rao: The Gervais Principle
39
Thank you!
@soobrosa
visuals: @xkcd, @DorsaAmir, ˙Cаvin 〄,
thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist,
Kalexanderson
40

"The Data Janitor 101", Daniel Molnar, Senior Data Scientist at Microsoft