Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Data Janitor 101
Daniel Molnar, Microsoft
Data Natives 2016
2
tl;dr
4 KISS is the philosophy,
4 take the long view, invest in durable knowledge,
4 strive for fast and good enough,
4 ju...
CAP #1
BUSINESS ANALYST
4
"... American MBA? ... if
you don’t understand
something it must be simple
and only take five
minutes."
1
Sean Murphy, Pin...
Don't
4 unicorn my a**,
4 hockey stick here for me,
4 skip leg day.
6
Do
4 make definitions,
4 show direction,
4 care about data quality,
4 rule dashboards.
7
KPIs that matter
4 DAU, WAU, MAU, LTV, churn,
4 cohorts, segments, funnels,
4 first hour, first day.
8
Approach
4 KPIs must hurt (aka no feelgood metrics),
4 you are what you measure,
4 you can run in one direction,
4 is it a...
Toolset
4 Excel,
4 SQL,
4 Metabase.
10
Heroes of the day
Joel Spolsky: You Suck at Excel
Dan McKinley: Data Driven Products Now!
11
CAP #2
DATA ENGINEER
12
"Don't reinvent
the flat tyre."
1
Alan Kay
13
Don't
4 just Apache it,
4 build a Hadoop JENGA (10x-235x slow),
4 real-time it,
4 stream it,
4 overengineer it.
14
Do
4 embrace dirty reality
(entity recognition makes a data engineer),
4 ETL, events and DWH,
4 data quality (know your le...
Approach
4 avoid GIGO,
4 pedal to the metal, skip the overhead,
4 know that big RAM is eating big data,
4 use open source,...
Toolset
4 UNIX (bash, make),
4 Python,
4 SQL,
4 ETL in batch (mETL, night-shift)
4 event tracking (Hamustro, logsanitizer,...
Heroes of the day
James Mickens: Computers are a Sadness, I am the Cure
Dan McKinley: Choose Boring Technology
David Beazl...
CAP #3
DATA SCIENTIST
19
"Friends don’t let friends
calculate p-values
(without fully
understanding them)."
1
Scott Weingart
20
Don't
4 expect CSVs and produce models whatever it takes,
4 expect that you have to explore the laws of Universe,
4 forget...
Do
4 user testing to define context (usertesting.com),
4 talk to users via surveys,
4 embed yourself in departments (perso...
Approach
4 you mostly tell what not to do,
4 it's hard, but still the only way,
4 persist when not finding anything or tri...
A/B
4 think twice about TCO,
4 the world isn’t identically distributed,
4 random variation will cheat you in small samples...
Toolset
4 SQL,
4 Wizard,
4 Python,
4 R (only to anger CS peeps).
25
Heroes of the day
Evan Miller: Wizard Statistical Analyzer
Chris Stucchio talks and posts on testing
26
Machine Learning
CAP #4
27
Don't
4 need a PhD,
4 develop new unique matrix algos, please,
4 need more than Excel,
4 give false hope.
28
Do
4 deploy good enough fast,
4 copy Kaggle (ensembles, random forest, XGBoost),
4 feature engineer,
4 build core data/fea...
Approach
4 the Mailchimp way
(offline built model redeployed each quarter),
4 hybrid approaches (domain expert, vanilla ML...
Toolset
4 Excel,
4 Wizard,
4 BigML,
4 Python.
31
Heroes of the day
John Foreman: Data Smart
Jeroen Janssen: Data Science at the Command Line
32
CAP #5
HEAD OF DATA
33
"In god we trust
everybody else bring
data to the table."
1
W. Edwards Deming
34
Don't
4 believe the hype,
4 trust no-one, just benchmarks,
4 let black box take over,
4 expect hiring to be easy.
35
Do
4 maintain data mythology,
4 keep the view backwards straight,
4 expect emotions,
4 see the future.
36
Approach
4 train to be the bearer of the bad news,
4 laugh at endless growth without saturation,
4 handle the cargo cult (...
Marketing
4 Google Analytics (sampling, off by 20%, no user
granularity, no raw, 150k per year),
4 CPA, FB CPA, mobile CPA...
Heroes of the day
Dan Lyons: Disrupted
Venkatesh Rao: The Gervais Principle
39
Thank you!
@soobrosa
visuals: @xkcd, @DorsaAmir, ˙Cаvin 〄,
thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist,
Kalex...
Upcoming SlideShare
Loading in …5
×

Data Janitor 101

692 views

Published on

If you’re trying to solve data related problems with no or limited resources, be them time, money or skills don’t go no further. This talk points mostly to decades old technology, free operating systems and cheap hardware if possible, but if it makes sense to spend a hundred bucks instead of tearing your hair, I’ll say so.

This talk is opinionated. This talk is for the underdog. Talks on theory and awkward technology descriptions are abundant, deep learning is the craze, but when it comes to the daily grind of data people around the world the silence is deep. I'm trying to share tips and tricks and fully baked recipes - even a full data stack with open source code relying on good ol' seventies tech for durability.

I'm sure that you've all made your related decision the best you could for reasons you have, but if you can influence decisions still here are my two cents. This talk does not contain made up sample code and false promises of fancy technology. I talk about stuff we use in production. Period. It’s gonna be fine, but your mileage may vary.

Published in: Data & Analytics
  • Be the first to comment

Data Janitor 101

  1. 1. 1
  2. 2. Data Janitor 101 Daniel Molnar, Microsoft Data Natives 2016 2
  3. 3. tl;dr 4 KISS is the philosophy, 4 take the long view, invest in durable knowledge, 4 strive for fast and good enough, 4 just because you can doesn't mean you should. 3
  4. 4. CAP #1 BUSINESS ANALYST 4
  5. 5. "... American MBA? ... if you don’t understand something it must be simple and only take five minutes." 1 Sean Murphy, PingThings 5
  6. 6. Don't 4 unicorn my a**, 4 hockey stick here for me, 4 skip leg day. 6
  7. 7. Do 4 make definitions, 4 show direction, 4 care about data quality, 4 rule dashboards. 7
  8. 8. KPIs that matter 4 DAU, WAU, MAU, LTV, churn, 4 cohorts, segments, funnels, 4 first hour, first day. 8
  9. 9. Approach 4 KPIs must hurt (aka no feelgood metrics), 4 you are what you measure, 4 you can run in one direction, 4 is it actionable (the Friday 1700 test). 9
  10. 10. Toolset 4 Excel, 4 SQL, 4 Metabase. 10
  11. 11. Heroes of the day Joel Spolsky: You Suck at Excel Dan McKinley: Data Driven Products Now! 11
  12. 12. CAP #2 DATA ENGINEER 12
  13. 13. "Don't reinvent the flat tyre." 1 Alan Kay 13
  14. 14. Don't 4 just Apache it, 4 build a Hadoop JENGA (10x-235x slow), 4 real-time it, 4 stream it, 4 overengineer it. 14
  15. 15. Do 4 embrace dirty reality (entity recognition makes a data engineer), 4 ETL, events and DWH, 4 data quality (know your leakage), 4 testing (yes, you can even unit test data). 15
  16. 16. Approach 4 avoid GIGO, 4 pedal to the metal, skip the overhead, 4 know that big RAM is eating big data, 4 use open source, pragmatic, cloud service agnostic tools. 16
  17. 17. Toolset 4 UNIX (bash, make), 4 Python, 4 SQL, 4 ETL in batch (mETL, night-shift) 4 event tracking (Hamustro, logsanitizer, RPi?), 4 DWH = MPP SQL (Azure DWH, Redshift, Vertica...). 17
  18. 18. Heroes of the day James Mickens: Computers are a Sadness, I am the Cure Dan McKinley: Choose Boring Technology David Beazley: Discovering Python 18
  19. 19. CAP #3 DATA SCIENTIST 19
  20. 20. "Friends don’t let friends calculate p-values (without fully understanding them)." 1 Scott Weingart 20
  21. 21. Don't 4 expect CSVs and produce models whatever it takes, 4 expect that you have to explore the laws of Universe, 4 forget about Occam's razor, 4 A/B test (only if it REALLY REALLY makes sense). 21
  22. 22. Do 4 user testing to define context (usertesting.com), 4 talk to users via surveys, 4 embed yourself in departments (personas), 4 have common sense. 22
  23. 23. Approach 4 you mostly tell what not to do, 4 it's hard, but still the only way, 4 persist when not finding anything or trivialities, 4 kill teh lurking causation. 23
  24. 24. A/B 4 think twice about TCO, 4 the world isn’t identically distributed, 4 random variation will cheat you in small samples, 4 most A/B test results are illusory, 4 small data -> go Bayesian = less certainty. 24
  25. 25. Toolset 4 SQL, 4 Wizard, 4 Python, 4 R (only to anger CS peeps). 25
  26. 26. Heroes of the day Evan Miller: Wizard Statistical Analyzer Chris Stucchio talks and posts on testing 26
  27. 27. Machine Learning CAP #4 27
  28. 28. Don't 4 need a PhD, 4 develop new unique matrix algos, please, 4 need more than Excel, 4 give false hope. 28
  29. 29. Do 4 deploy good enough fast, 4 copy Kaggle (ensembles, random forest, XGBoost), 4 feature engineer, 4 build core data/feature (augment and enhance). 29
  30. 30. Approach 4 the Mailchimp way (offline built model redeployed each quarter), 4 hybrid approaches (domain expert, vanilla ML), 4 you are a machine instructor, 4 Tensorflow (logic to clients, handle models). 30
  31. 31. Toolset 4 Excel, 4 Wizard, 4 BigML, 4 Python. 31
  32. 32. Heroes of the day John Foreman: Data Smart Jeroen Janssen: Data Science at the Command Line 32
  33. 33. CAP #5 HEAD OF DATA 33
  34. 34. "In god we trust everybody else bring data to the table." 1 W. Edwards Deming 34
  35. 35. Don't 4 believe the hype, 4 trust no-one, just benchmarks, 4 let black box take over, 4 expect hiring to be easy. 35
  36. 36. Do 4 maintain data mythology, 4 keep the view backwards straight, 4 expect emotions, 4 see the future. 36
  37. 37. Approach 4 train to be the bearer of the bad news, 4 laugh at endless growth without saturation, 4 handle the cargo cult (inverse causality). 37
  38. 38. Marketing 4 Google Analytics (sampling, off by 20%, no user granularity, no raw, 150k per year), 4 CPA, FB CPA, mobile CPA, conversion, attribution, 4 Net Promoter Score. 38
  39. 39. Heroes of the day Dan Lyons: Disrupted Venkatesh Rao: The Gervais Principle 39
  40. 40. Thank you! @soobrosa visuals: @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist, Kalexanderson 40

×