Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Data Janitor Returns

601 views

Published on

This talk is for the underdog. If you're trying to solve data related problems with no or limited resources, be them time, money or skills don't go no further. This talk is opinionated and updated to GDPR, deep learning and all the hype .

Published in: Data & Analytics
  • DOWNLOAD. THIS BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... DOWNLOAD. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... DOWNLOAD. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... DOWNLOAD. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... DOWNLOAD. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... DOWNLOAD. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... DOWNLOAD. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

The Data Janitor Returns

  1. 1. Daniel Molnar @ Oberlo/Shopify / Data Natives @ Berlin @ 2018-11-22
  2. 2. Where I'm coming from • senior data analy.cs engineer, • head of data and analy.cs, • senior applied and data scien.st, • data analyst, • or just data janitor.
  3. 3. Perspec've • rounded, not complete, • slow, old, stupid and lazy and
  4. 4. tl;dr
  5. 5. tl;dr • KISS is the philosophy, • take the long view, invest in durable knowledge, • strive for fast and good enough, • just because you can doesn't mean you should.
  6. 6. tl;dr (new) • KISS is the philosophy, • take the long view, invest in durable knowledge, • strive for fast and good enough, • just because you can doesn't mean you should, • figure what to worry about, • you are not Google.
  7. 7. it used to be a hype now this is a war nobody's your friend they want your money and data (preferably both locked in)
  8. 8. Things you worry about: • machine learning, • deep learning, • GDPR.
  9. 9. Things you should really worry about: • machine learning adblockers, • deep learning ELT, • GDPR, CRM (yes, CRM).
  10. 10. AGGREGATE & LABEL
  11. 11. Don't skip leg day.
  12. 12. Do make programma'c KPI defini'ons.
  13. 13. Look at the *** data
  14. 14. Toolset Python, (P)SQL, Metabase.
  15. 15. Usual suspect: NPS • one, simple number you can squint at, • sampling is skewed, • answer is unsure, • easy to hack step func:on1 , MONKEYPATCH: look at the change of the distro. 1 Eve Rajca aka @EveTheAnalyst
  16. 16. Predictably wrong? Google Analy4cs!
  17. 17. Hero of the day Mar$n Loetzsch @mar$n_loetzsch -=- KPIs for e-commerce startups Data Science in Early Stage Startups: the Struggle to Create Value https://github.com/mara
  18. 18. LEARN & OPTIMIZE
  19. 19. Half of the *me when companies say they need "AI" what they really need is a SELECT clause with GROUP BY. You're welcome. — Mat Velloso @matvelloso (Technical Advisor to CTO at Microso9)
  20. 20. Don't do A/B tests 99% it will not worth doing it
  21. 21. ... conversion rate is 2% ... detec0ng a rela0ve change of 1% requires an experiment with 12 million users ... — Simon Jackson (Booking.com)
  22. 22. R?Shiny.
  23. 23. Usual suspects • Non-reproducable experiments and tests. • R hodpepodge in produc9on. • Beliefs hidden as implicits in models.
  24. 24. ML~AI~DEEP*
  25. 25. You don't have (enough) data. @karpathy
  26. 26. Make your own data points!
  27. 27. Deploy good enough fast?
  28. 28. Deep learn my *** Do you really need it? Tensorflow! ... ... so distributed deep learning can compress porn on the end device.
  29. 29. Hero of the day Szilard [Deeper than Deep Learning] @DataScienceLA -=- Be#er than Deep Learning: Gradient Boos4ng Machines (GBMs) https://github.com/ szilard/benchm-ml
  30. 30. Spark MLlibs GBM implementa3on is 10x slower, uses 10x more memory and is buggy/ lower accuracy. Total fucking garbage! — Szilard [Deeper than Deep Learning] @DataScienceLA
  31. 31. MOVE STORE EXPLORE TRANSFORM
  32. 32. Q: Why are there so many programmers from Eastern Europe? A: Slavic pessimism. Everything that can go wrong will go wrong. With such a mindset programming comes naturally. — Mar&n Sustrik @sustrik (Creator of ZeroMQ, nanomsg, libdill.)
  33. 33. over engineering @elmoswelt
  34. 34. you get an other machine if you can use one
  35. 35. Do embrace dirty reality.
  36. 36. Get cloud agnos.c! • AWS s'll leads the pack by far • Azure will sell anyway, and all will cry, • Google competes with the cheap and uncooked
  37. 37. ETL is #solved OMG • Airflow is an overengineered underperforming nightmare, • metl for source mappings in magnitude, • Mara for generic e-commerce, • night-shift for explicit minimalism.
  38. 38. Showdown
  39. 39. Hero of the day Mark Litwintschik @marklit82 Summary of the 1.1 Billion Taxi Rides Benchmarks (500 GB uncompressed CSV) https:// tech.marksblogg.com
  40. 40. Spark Setup Query Median QM per vCPU Cost/hour 11 x m3.xlarge + HDFS 14,91 0,34 27,5 1 x i3.8xlarge + HDFS 26,00 0,81 2,5 21 x m3.xlarge + HDFS 32,00 0,38 5,67 5 x m3.xlarge + S3 466,50 23,33 1,35 3 x Raspberry Pi 1738,00 144,83 HDFS. RPi = 1/6 VCPU ~100 EUR. Linear scaling.
  41. 41. Presto Setup Query Median QM per vCPU Cost/hour 50 x n1-standard-4 7,00 0,04 9.50 21 x m3.xlarge 11,50 0,14 5.67 10 x n1-standard-4 16,00 0,36 2.09 1 x i3.8xlarge + HDFS 15,00 0,47 2.50 5 x m3.xlarge + HDFS 51,50 0,26 1.35 50 x m3.xlarge + S3 43,50 0,22 13.50 Workhorse in favour. HDFS. 1 machine. Non-linear scaling.
  42. 42. Lazy Evalua*on Setup Query Median Cost/hour Redshi', 6 x ds2.8xlarge 1,91 40.80 BigQuery 2,00 Amazon Athena 6,30 Presto, 50 x n1-standard-4 7,00 9.50 Spark, 11 x m3.xlarge + HDFS 14,91 27.50 The human cost -- in both terms.
  43. 43. One Machine Setup Query Median QM per vCPU Cost/hour ClickHouse 4,21 1,05 Elas3csearch tuned 13,14 3,29 Presto, 1 x i3.8xlarge + HDFS 15,00 0.47 2.50 Spark, 1 x i3.8xlarge + HDFS 26,00 0,81 2.50 Ver3ca 32,80 8,20 Elas3csearch 48,89 12,22 PSQL 9.5 + cstore_fdw 205,00 51,25 Intel Core i5 4670K VS i3.8xlarge (32 VCPUs). Desktop example costs <600 EUR.
  44. 44. Do you use adblocking?
  45. 45. Do you use Google Analy+cs?
  46. 46. 9%of the events are lost to ~all third party trackers due to adblocking.
  47. 47. Sink > Sieve > Sort ELT aka SQL on flat files with the minimum amount of code wri:en.
  48. 48. BIRD OF PREY
  49. 49. Who are you? • Lip service provider. • Fake news producer. • Kingmaker. Are you the fool or the grey eminent?
  50. 50. Don't believe the hype. HR: good people leave.
  51. 51. Marke&ng
  52. 52. Will this ever get be-er? • adblocking, • CPA silver bullets are gone, • conversion & a8ribu9on are hard nuts, • FB and GO are not your friends (the 900% on videos), • but CRM is.
  53. 53. GDPR • road to hell is paved with good inten2ons, • it's about the process, matey, • mostly fair, • yes, you have to clean up your mess, • dunno, wouldn't buy programma2c shares2 . 2 Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
  54. 54. Thank you! @soobrosa We're hiring! visuals: @mroga., @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist, Kalexanderson, Shopify Burst

×