SlideShare a Scribd company logo
1 of 59
Daniel Molnar @ Oberlo/Shopify / Data Natives @ Berlin @ 2018-11-22
Where I'm coming from
• senior data analy.cs engineer,
• head of data and analy.cs,
• senior applied and data scien.st,
• data analyst,
• or just data janitor.
Perspec've
• rounded, not complete,
• slow, old, stupid and lazy and
tl;dr (new)
• KISS is the philosophy,
• take the long view, invest in durable knowledge,
• strive for fast and good enough,
• just because you can doesn't mean you should,
• figure what to worry about,
• you are not Google.
it used to be a hype
now this is a war
nobody's your friend
they want your money and data (preferably both locked in)
Things you worry about:
• machine learning,
• deep learning,
• GDPR.
Things you should really worry about:
• machine learning adblockers,
• deep learning ELT,
• GDPR, CRM (yes, CRM).
AGGREGATE
& LABEL
Don't skip
leg day.
Do
make
programma'c KPI defini'ons.
Look at the *** data
Toolset
Python,
(P)SQL,
Metabase.
Usual suspect: NPS
• one, simple number you can squint at,
• sampling is skewed,
• answer is unsure,
• easy to hack step func:on1
,
MONKEYPATCH: look at the change of the distro.
1
Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
Google
Analy&cs?
Hero of the day
Mar$n Loetzsch
@mar$n_loetzsch
-=-
KPIs for e-commerce startups
Data Science in Early Stage
Startups: the Struggle to Create
Value
https://github.com/mara
LEARN &
OPTIMIZE
Half of the *me when companies
say they need "AI" what they really
need is a SELECT clause with
GROUP BY. You're welcome.
— Mat Velloso @matvelloso (Technical Advisor to CTO at Microso9)
Don't do A/B tests
99% it will not worth doing it
... conversion rate is 2% ... detec0ng
a rela0ve change of 1% requires an
experiment with 12 million users ...
— Simon Jackson (Booking.com)
R?Shiny.
Usual suspects
• Non-reproducable experiments and tests.
• R hodpepodge in produc9on.
• Beliefs hidden as implicits in models.
ML~AI~DEEP*
You don't have (enough) data.
Make your own data points!
Deploy good enough fast?
Deep learn my ***
Do you really need it?
Tensorflow! ...
... so distributed deep learning
can compress porn on the end
device.
Hero of the day
Szilard [Deeper than Deep
Learning] @DataScienceLA
-=-
Be#er than Deep Learning:
Gradient Boos4ng Machines
(GBMs)
https://github.com/
szilard/benchm-ml
Spark MLlibs GBM implementa3on is 10x
slower, uses 10x more memory and is buggy/
lower accuracy. Total fucking garbage!
— Szilard [Deeper than Deep Learning] @DataScienceLA
MOVE
STORE
EXPLORE
TRANSFORM
Q: Why are there so many
programmers from Eastern Europe?
A: Slavic pessimism. Everything that
can go wrong will go wrong. With
such a mindset programming comes
naturally.
— Mar&n Sustrik @sustrik (Creator of ZeroMQ, nanomsg, libdill.)
over
engineering
you get an other machine
if you can use
one
Do embrace
dirty reality.
Get cloud agnos.c!
• AWS s'll leads the pack by far
• Azure will sell anyway, and all will cry,
• Google competes with the cheap and uncooked
ETL is #solved OMG
• Airflow is an overengineered underperforming nightmare,
• metl for source mappings in magnitude,
• Mara for generic e-commerce,
• night-shift for explicit minimalism.
Showdown
Hero of the day
Mark Litwintschik @marklit82
Summary of the 1.1 Billion Taxi
Rides Benchmarks (500 GB
uncompressed CSV)
https://
tech.marksblogg.com
Spark
Setup Query Median QM per vCPU Cost/hour
11 x m3.xlarge + HDFS 14,91 0,34 27,5
1 x i3.8xlarge + HDFS 26,00 0,81 2,5
21 x m3.xlarge + HDFS 32,00 0,38 5,67
5 x m3.xlarge + S3 466,50 23,33 1,35
3 x Raspberry Pi 1738,00 144,83
HDFS. RPi = 1/6 VCPU ~100 EUR. Linear scaling.
Presto
Setup Query Median QM per vCPU Cost/hour
50 x n1-standard-4 7,00 0,04 9.50
21 x m3.xlarge 11,50 0,14 5.67
10 x n1-standard-4 16,00 0,36 2.09
1 x i3.8xlarge + HDFS 15,00 0,47 2.50
5 x m3.xlarge + HDFS 51,50 0,26 1.35
50 x m3.xlarge + S3 43,50 0,22 13.50
Workhorse in favour. HDFS. 1 machine. Non-linear scaling.
Lazy Evalua*on
Setup Query Median Cost/hour
Redshi', 6 x ds2.8xlarge 1,91 40.80
BigQuery 2,00
Amazon Athena 6,30
Presto, 50 x n1-standard-4 7,00 9.50
Spark, 11 x m3.xlarge + HDFS 14,91 27.50
The human cost -- in both terms.
One Machine
Setup Query Median QM per vCPU Cost/hour
ClickHouse 4,21 1,05
Elas3csearch tuned 13,14 3,29
Presto, 1 x i3.8xlarge + HDFS 15,00 0.47 2.50
Spark, 1 x i3.8xlarge + HDFS 26,00 0,81 2.50
Ver3ca 32,80 8,20
Elas3csearch 48,89 12,22
PSQL 9.5 + cstore_fdw 205,00 51,25
Intel Core i5 4670K VS i3.8xlarge (32 VCPUs). Desktop example costs <600 EUR.
Do you
use adblocking?
Do you use
Google Analy+cs?
9%of the events are lost to ~all third party trackers
due to adblocking.
Sink > Sieve > Sort
ELT aka SQL on flat files with the minimum amount of code wri:en.
BIRD
OF
PREY
Who are you?
• Lip service provider.
• Fake news producer.
• Kingmaker.
Are you the fool
or the grey eminent?
Don't believe the hype.
HR: good people leave.
Marke&ng
Will this ever get be-er?
• adblocking,
• CPA silver bullets are gone,
• conversion & a8ribu9on are hard nuts,
• FB and GO are not your friends (the 900% on videos),
• but CRM is.
GDPR
• road to hell is paved with good inten2ons,
• it's about the process, matey,
• mostly fair,
• yes, you have to clean up your mess,
• dunno, wouldn't buy programma2c shares1
.
1
Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
Thank you!
@soobrosa
We're hiring!
visuals: @mroga., @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk,
jonolist, Kalexanderson, Shopify Burst

More Related Content

Similar to The Data Janitor Returns | Daniel Molnar | DN18

"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...Edge AI and Vision Alliance
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Version Control in AI/Machine Learning by Datmo
Version Control in AI/Machine Learning by DatmoVersion Control in AI/Machine Learning by Datmo
Version Control in AI/Machine Learning by DatmoNicholas Walsh
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdf00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdfeanyang7
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondNUS-ISS
 
Taming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AITaming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AIdesmondchanatdet
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
Practical DMD Scripting
Practical DMD Scripting Practical DMD Scripting
Practical DMD Scripting Zenoss
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 

Similar to The Data Janitor Returns | Daniel Molnar | DN18 (20)

"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Version Control in AI/Machine Learning by Datmo
Version Control in AI/Machine Learning by DatmoVersion Control in AI/Machine Learning by Datmo
Version Control in AI/Machine Learning by Datmo
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdf00_pytorch_and_deep_learning_fundamentals.pdf
00_pytorch_and_deep_learning_fundamentals.pdf
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
Django at Scale
Django at ScaleDjango at Scale
Django at Scale
 
Taming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AITaming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AI
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Practical DMD Scripting
Practical DMD Scripting Practical DMD Scripting
Practical DMD Scripting
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 

More from DataconomyGmbH

Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18DataconomyGmbH
 
Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018DataconomyGmbH
 
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...DataconomyGmbH
 
Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18DataconomyGmbH
 
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...DataconomyGmbH
 
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18DataconomyGmbH
 
Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18DataconomyGmbH
 
Building Sustainable Machine Learning Products for Communities, by Communit...
Building Sustainable Machine Learning Products  for Communities,  by Communit...Building Sustainable Machine Learning Products  for Communities,  by Communit...
Building Sustainable Machine Learning Products for Communities, by Communit...DataconomyGmbH
 
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18DataconomyGmbH
 
Undermining democracy | Alisa Kolesnikova | DN18
Undermining  democracy | Alisa Kolesnikova | DN18Undermining  democracy | Alisa Kolesnikova | DN18
Undermining democracy | Alisa Kolesnikova | DN18DataconomyGmbH
 
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18DataconomyGmbH
 
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18DataconomyGmbH
 
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDERLinked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDERDataconomyGmbH
 
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDERLiving in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDERDataconomyGmbH
 
Are You Ready for the Quickening!
Are You Ready for the Quickening!Are You Ready for the Quickening!
Are You Ready for the Quickening!DataconomyGmbH
 

More from DataconomyGmbH (15)

Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18
 
Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018
 
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
 
Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18
 
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
 
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
 
Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18
 
Building Sustainable Machine Learning Products for Communities, by Communit...
Building Sustainable Machine Learning Products  for Communities,  by Communit...Building Sustainable Machine Learning Products  for Communities,  by Communit...
Building Sustainable Machine Learning Products for Communities, by Communit...
 
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
 
Undermining democracy | Alisa Kolesnikova | DN18
Undermining  democracy | Alisa Kolesnikova | DN18Undermining  democracy | Alisa Kolesnikova | DN18
Undermining democracy | Alisa Kolesnikova | DN18
 
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
 
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
 
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDERLinked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
 
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDERLiving in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
 
Are You Ready for the Quickening!
Are You Ready for the Quickening!Are You Ready for the Quickening!
Are You Ready for the Quickening!
 

Recently uploaded

Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridihmeghakumariji156
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...vershagrag
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...vershagrag
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 

Recently uploaded (20)

Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 

The Data Janitor Returns | Daniel Molnar | DN18

  • 1. Daniel Molnar @ Oberlo/Shopify / Data Natives @ Berlin @ 2018-11-22
  • 2. Where I'm coming from • senior data analy.cs engineer, • head of data and analy.cs, • senior applied and data scien.st, • data analyst, • or just data janitor.
  • 3. Perspec've • rounded, not complete, • slow, old, stupid and lazy and
  • 4. tl;dr (new) • KISS is the philosophy, • take the long view, invest in durable knowledge, • strive for fast and good enough, • just because you can doesn't mean you should, • figure what to worry about, • you are not Google.
  • 5. it used to be a hype now this is a war nobody's your friend they want your money and data (preferably both locked in)
  • 6. Things you worry about: • machine learning, • deep learning, • GDPR.
  • 7. Things you should really worry about: • machine learning adblockers, • deep learning ELT, • GDPR, CRM (yes, CRM).
  • 8.
  • 12. Look at the *** data
  • 14. Usual suspect: NPS • one, simple number you can squint at, • sampling is skewed, • answer is unsure, • easy to hack step func:on1 , MONKEYPATCH: look at the change of the distro. 1 Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
  • 16. Hero of the day Mar$n Loetzsch @mar$n_loetzsch -=- KPIs for e-commerce startups Data Science in Early Stage Startups: the Struggle to Create Value https://github.com/mara
  • 17.
  • 19. Half of the *me when companies say they need "AI" what they really need is a SELECT clause with GROUP BY. You're welcome. — Mat Velloso @matvelloso (Technical Advisor to CTO at Microso9)
  • 20. Don't do A/B tests 99% it will not worth doing it
  • 21. ... conversion rate is 2% ... detec0ng a rela0ve change of 1% requires an experiment with 12 million users ... — Simon Jackson (Booking.com)
  • 23. Usual suspects • Non-reproducable experiments and tests. • R hodpepodge in produc9on. • Beliefs hidden as implicits in models.
  • 24.
  • 26. You don't have (enough) data.
  • 27. Make your own data points!
  • 29. Deep learn my *** Do you really need it? Tensorflow! ... ... so distributed deep learning can compress porn on the end device.
  • 30. Hero of the day Szilard [Deeper than Deep Learning] @DataScienceLA -=- Be#er than Deep Learning: Gradient Boos4ng Machines (GBMs) https://github.com/ szilard/benchm-ml
  • 31. Spark MLlibs GBM implementa3on is 10x slower, uses 10x more memory and is buggy/ lower accuracy. Total fucking garbage! — Szilard [Deeper than Deep Learning] @DataScienceLA
  • 32.
  • 33.
  • 35. Q: Why are there so many programmers from Eastern Europe? A: Slavic pessimism. Everything that can go wrong will go wrong. With such a mindset programming comes naturally. — Mar&n Sustrik @sustrik (Creator of ZeroMQ, nanomsg, libdill.)
  • 36.
  • 38. you get an other machine if you can use one
  • 40. Get cloud agnos.c! • AWS s'll leads the pack by far • Azure will sell anyway, and all will cry, • Google competes with the cheap and uncooked
  • 41. ETL is #solved OMG • Airflow is an overengineered underperforming nightmare, • metl for source mappings in magnitude, • Mara for generic e-commerce, • night-shift for explicit minimalism.
  • 43. Hero of the day Mark Litwintschik @marklit82 Summary of the 1.1 Billion Taxi Rides Benchmarks (500 GB uncompressed CSV) https:// tech.marksblogg.com
  • 44. Spark Setup Query Median QM per vCPU Cost/hour 11 x m3.xlarge + HDFS 14,91 0,34 27,5 1 x i3.8xlarge + HDFS 26,00 0,81 2,5 21 x m3.xlarge + HDFS 32,00 0,38 5,67 5 x m3.xlarge + S3 466,50 23,33 1,35 3 x Raspberry Pi 1738,00 144,83 HDFS. RPi = 1/6 VCPU ~100 EUR. Linear scaling.
  • 45. Presto Setup Query Median QM per vCPU Cost/hour 50 x n1-standard-4 7,00 0,04 9.50 21 x m3.xlarge 11,50 0,14 5.67 10 x n1-standard-4 16,00 0,36 2.09 1 x i3.8xlarge + HDFS 15,00 0,47 2.50 5 x m3.xlarge + HDFS 51,50 0,26 1.35 50 x m3.xlarge + S3 43,50 0,22 13.50 Workhorse in favour. HDFS. 1 machine. Non-linear scaling.
  • 46. Lazy Evalua*on Setup Query Median Cost/hour Redshi', 6 x ds2.8xlarge 1,91 40.80 BigQuery 2,00 Amazon Athena 6,30 Presto, 50 x n1-standard-4 7,00 9.50 Spark, 11 x m3.xlarge + HDFS 14,91 27.50 The human cost -- in both terms.
  • 47. One Machine Setup Query Median QM per vCPU Cost/hour ClickHouse 4,21 1,05 Elas3csearch tuned 13,14 3,29 Presto, 1 x i3.8xlarge + HDFS 15,00 0.47 2.50 Spark, 1 x i3.8xlarge + HDFS 26,00 0,81 2.50 Ver3ca 32,80 8,20 Elas3csearch 48,89 12,22 PSQL 9.5 + cstore_fdw 205,00 51,25 Intel Core i5 4670K VS i3.8xlarge (32 VCPUs). Desktop example costs <600 EUR.
  • 49. Do you use Google Analy+cs?
  • 50. 9%of the events are lost to ~all third party trackers due to adblocking.
  • 51. Sink > Sieve > Sort ELT aka SQL on flat files with the minimum amount of code wri:en.
  • 52.
  • 54. Who are you? • Lip service provider. • Fake news producer. • Kingmaker. Are you the fool or the grey eminent?
  • 55. Don't believe the hype. HR: good people leave.
  • 57. Will this ever get be-er? • adblocking, • CPA silver bullets are gone, • conversion & a8ribu9on are hard nuts, • FB and GO are not your friends (the 900% on videos), • but CRM is.
  • 58. GDPR • road to hell is paved with good inten2ons, • it's about the process, matey, • mostly fair, • yes, you have to clean up your mess, • dunno, wouldn't buy programma2c shares1 . 1 Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
  • 59. Thank you! @soobrosa We're hiring! visuals: @mroga., @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist, Kalexanderson, Shopify Burst