SlideShare a Scribd company logo
1 of 55
Download to read offline
From Lab to Factory
LessonslearneddevelopingDataProducts
Keynote PyCon Colombia
Feb 2017
peadarcoyle@googlemail.com
All opinions my own
Who am I?
Data Scientist at Elevate (Recruitment Startup) and Author
@springcoil
About 5 years in Machine Learning/Data Science
OSS contributor (mostly PyMC3)
Built Ad-hoc analysis, Data Products and Data Pipelines at
Why talk about this?
Data Moats
Roadmap
What is Data Science?
Type of Data Scientists
Deploying Data Science (tech/culture)
Success stories and recommendations
Bigger idea (Change)
What IS a Data Scientist?
There's a Data Science Spectrum
HT: Sean J. Taylor (Facebook Research Scientist)
Data Science to Value
Data doesn't do anything!
People need to make decisions (strategic/ tactical)
Or the product needs to alter someones decision making (Spotify
Discover Weekly, Amazon Recommendations, etc)
(Taken from Josh Wills talk - Lab to Factory 2014)
or https://hbr.org/2013/04/two-departments-for-data-succe
Expectations and reality
Data is the new oil...
But. Data isn't a product. It needs to be turned into one!
Data Science is hard
Data and technical problems
Skill set (emotional and tech)
Machine Learning Ops
Data Science projects are risky!
Many stakeholders think that data science is just an engineering problem,
but research is high risk and high reward
De-risking the project - how? Send me examples :)
http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-
fail/
https://github.com/ianozsvald/data_science_delivered
Success with Data Science
1. People
2. Ideas
3. Things
Source: USAF
R and D needs cultural support
Your culture needs to avoid the HiPPO (highest paid
persons opinion) to be data-informed or data-
driven.
Good cultures
Data democratization
Clear objectives and metrics against objectives
Financial metrics don't always win - sometimes brand wins
See by Martin Goodsonhttp://bit.ly/cultureanddata
Some data scientists
do experiments and
build prototypes
Production
(HT: The Yhat people - )www.yhathq.com
Deploying Data Products is hard
Monitoring and Alerting
Deployment
Real world data is tricky (training/ testing)
Interpretability (explaining fraud detection/ad tech)
Feature engineering
Models decay over time
What projects work?
Explain existing data (visualization!)
Automate repetitive/ slow processes
Augment data to make new data (Search engines, ML models)
Predict the future (do something more accurately than gut feel )
Simulate using statistics :) (Sports analytics models)
"You need data first" - Peadar Coyle
Copying and pasting PDF/PNG data
Getting data in some areas is hard!!
Some tools for web data extraction
Messy APIs without documentation :(
Visualise ALL THE THINGS!!
(Relay foods dataset - HT Greg Reda)
Consumer behaviour at a Fast Food Restaurant per year in the USA
Augmenting data and using API's
Machine Learning
Production data/ Real world
HT: Andrew Ng
Everyone ETLS
Only 1% of your time will be spent modelling
Data pipelines and your infrastructure matters -
Tools such as Spark, Luigi, Airflow or Dask are important
Monitoring of pipelines. Scalability. Machine provision
Eoin Brazil Talk
Success Story (1)
1. Ad Tech project (Media company) -
original process took 10 hours per week of analyst time.
2. Broke down process and produced simple predictive model
3. Data type challenges (unicode in Python 2.7) and business
expectations
4. Now takes less than 30 minutes each week.
Success Story (2)
Elevate is a Total Talent Management platform (tools for streamlining
recruitment)
Several models in production - microservices (recommenders, job type
prediction)
Data pipeline (Luigi) necessary for producing features
Lots of challenges about Docker, time, updated data.
Works: MyPy (Python 3.5), code review, testing -
Next: Moving to PySpark for ETL, more pair programming
http://hypothesis.works/
Failure
1. Lack of support from tech teams
2. Lack of clearly defined customer
3. No culture of DevOps
4. Management had no clear vision
Lessons learned from Lab to Factory
1. Monitoring - data products need evaluation in production
2. Lack of a shared language between software engineers and data
scientists. Pair programming/ Code review are some solutions.
3. To help data scientists and analysts succeed your business needs to be
prepared to invest in tooling. (Pipelines, DevOps, Reproducible)
4. Often you're working with other teams who use different languages -
so micro services can be a good idea (example C#.net app and Python
microservice)
How to deploy a model?
Palladium (Otto Group)
Azure
Flask Microservice (DIY) and Docker
Dev Ops
We need each other!
Use small data where possible!!
Small problems with clean data are more important - (Ian Ozsvald)
Amazon machine with many Xeons and 244GB of RAM is less than 3
euros per hour. - (Ian Ozsvald)
Blaze, Xray, Dask, Ibis, etc etc -
"The mean size of a cluster will remain 1" - Matt Rocklin
PyData Bikeshed
Focus on the real problem
Closing remarks
Dirty data stops projects
Change happens, we need to help businesses adapt or be agile
There are some good projects like Icy, Luigi, etc for transforming data
and improving data extraction
- Google paper
on rule for building ML systems
http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
Closing Remarks (2)
Stakeholder management is a challenge too
Send me your dirty data and data deployment stories :)
www.peadarcoyle.com
https://leanpub.com/interviewswithdatascientists
A New World
Simulate: Rugby games with MCMC
(PyMC3)
What is the Data Science process?
Obtain
Scrub
Explore
Model
Interpret
Communicate (or Deploy)
Some NLP on the Interviews!
What do Data Scientists talk about?
Based on my Interview series!Dataconomy
A famous 'data product' - Recommendation engines
Credits
http://cdn.yourarticlelibrary.com/wp-content/uploads/2013/12/080.jpg
https://www.flickr.com/photos/82066314@N06/7624218468/sizes/z/
From Lab to Factory: Creating value with data

More Related Content

What's hot

Data science services YLS
Data science services YLSData science services YLS
Data science services YLS
Dima Semchuk
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: Kepler
David LeBauer
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 

What's hot (20)

H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
Data science services YLS
Data science services YLSData science services YLS
Data science services YLS
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: Kepler
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
 
cyREST: Cytoscape as a Service
cyREST: Cytoscape as a ServicecyREST: Cytoscape as a Service
cyREST: Cytoscape as a Service
 
Hadoop
HadoopHadoop
Hadoop
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Towards the Cytoscape Cyberinfrastructure
Towards the Cytoscape CyberinfrastructureTowards the Cytoscape Cyberinfrastructure
Towards the Cytoscape Cyberinfrastructure
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
H2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine PrognosticsH2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine Prognostics
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 

Viewers also liked

Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
Deepak Narayanan
 
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CanSecWest
 

Viewers also liked (20)

Strux engels lezen week 26
Strux engels lezen week 26Strux engels lezen week 26
Strux engels lezen week 26
 
MySQL 5.7 InnoDB 日本語全文検索
MySQL 5.7 InnoDB 日本語全文検索MySQL 5.7 InnoDB 日本語全文検索
MySQL 5.7 InnoDB 日本語全文検索
 
[INFOGRAPHIC] Women in Leadership: Why It Matters
[INFOGRAPHIC] Women in Leadership: Why It Matters[INFOGRAPHIC] Women in Leadership: Why It Matters
[INFOGRAPHIC] Women in Leadership: Why It Matters
 
Tips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPTTips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPT
 
Strategies to Support Open Educational Resources for Student Success: Case Ex...
Strategies to Support Open Educational Resources for Student Success: Case Ex...Strategies to Support Open Educational Resources for Student Success: Case Ex...
Strategies to Support Open Educational Resources for Student Success: Case Ex...
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Let’s grow
Let’s growLet’s grow
Let’s grow
 
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
 
The Ultimate Security Checklist While Launching Your Android App
The Ultimate Security Checklist While Launching Your Android AppThe Ultimate Security Checklist While Launching Your Android App
The Ultimate Security Checklist While Launching Your Android App
 
10 Good Reasons - NetApp AltaVault
10 Good Reasons - NetApp AltaVault10 Good Reasons - NetApp AltaVault
10 Good Reasons - NetApp AltaVault
 
Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017
 
スタートアップを陰ながら支えるときに心がけるべき5ヶ条
スタートアップを陰ながら支えるときに心がけるべき5ヶ条スタートアップを陰ながら支えるときに心がけるべき5ヶ条
スタートアップを陰ながら支えるときに心がけるべき5ヶ条
 
Diagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounderDiagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounder
 
Global Education and Skills Forum 2017 - Educating Global Citizens
Global Education and Skills Forum  2017 -  Educating Global CitizensGlobal Education and Skills Forum  2017 -  Educating Global Citizens
Global Education and Skills Forum 2017 - Educating Global Citizens
 
Can social media change health behavior?
Can social media change health behavior?Can social media change health behavior?
Can social media change health behavior?
 
一般的なチートの手法と対策について
一般的なチートの手法と対策について一般的なチートの手法と対策について
一般的なチートの手法と対策について
 
Cancer Care in a Post Truth World
Cancer Care in a Post Truth World Cancer Care in a Post Truth World
Cancer Care in a Post Truth World
 

Similar to From Lab to Factory: Creating value with data

Similar to From Lab to Factory: Creating value with data (20)

From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into value
 
Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAdi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
 
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAdi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
On Big Data
On Big DataOn Big Data
On Big Data
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
Ornl IT
Ornl ITOrnl IT
Ornl IT
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedNeurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons Learned
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdf
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 

More from Peadar Coyle

More from Peadar Coyle (8)

Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
 
Variational Inference in Python
Variational Inference in PythonVariational Inference in Python
Variational Inference in Python
 
Consulting Skills for Data Scientists
Consulting Skills for Data ScientistsConsulting Skills for Data Scientists
Consulting Skills for Data Scientists
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
Big Data and Internet of Things for Managers
Big Data and Internet of Things for ManagersBig Data and Internet of Things for Managers
Big Data and Internet of Things for Managers
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.
 
Probabilistic Programming in Python
Probabilistic Programming in PythonProbabilistic Programming in Python
Probabilistic Programming in Python
 
How can Data Science benefit your business?
How can Data Science benefit your business?How can Data Science benefit your business?
How can Data Science benefit your business?
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

From Lab to Factory: Creating value with data

  • 1. From Lab to Factory LessonslearneddevelopingDataProducts Keynote PyCon Colombia Feb 2017 peadarcoyle@googlemail.com All opinions my own
  • 2. Who am I? Data Scientist at Elevate (Recruitment Startup) and Author @springcoil About 5 years in Machine Learning/Data Science OSS contributor (mostly PyMC3) Built Ad-hoc analysis, Data Products and Data Pipelines at
  • 3.
  • 6. Roadmap What is Data Science? Type of Data Scientists Deploying Data Science (tech/culture) Success stories and recommendations Bigger idea (Change)
  • 7. What IS a Data Scientist?
  • 8. There's a Data Science Spectrum HT: Sean J. Taylor (Facebook Research Scientist)
  • 9. Data Science to Value Data doesn't do anything! People need to make decisions (strategic/ tactical) Or the product needs to alter someones decision making (Spotify Discover Weekly, Amazon Recommendations, etc)
  • 10. (Taken from Josh Wills talk - Lab to Factory 2014) or https://hbr.org/2013/04/two-departments-for-data-succe
  • 11.
  • 13. Data is the new oil... But. Data isn't a product. It needs to be turned into one!
  • 14. Data Science is hard Data and technical problems Skill set (emotional and tech) Machine Learning Ops
  • 15. Data Science projects are risky! Many stakeholders think that data science is just an engineering problem, but research is high risk and high reward De-risking the project - how? Send me examples :) http://www.martingoodson.com/ten-ways-your-data-project-is-going-to- fail/ https://github.com/ianozsvald/data_science_delivered
  • 16. Success with Data Science 1. People 2. Ideas 3. Things Source: USAF
  • 17.
  • 18. R and D needs cultural support
  • 19. Your culture needs to avoid the HiPPO (highest paid persons opinion) to be data-informed or data- driven.
  • 20. Good cultures Data democratization Clear objectives and metrics against objectives Financial metrics don't always win - sometimes brand wins See by Martin Goodsonhttp://bit.ly/cultureanddata
  • 21. Some data scientists do experiments and build prototypes
  • 23. (HT: The Yhat people - )www.yhathq.com
  • 24. Deploying Data Products is hard Monitoring and Alerting Deployment Real world data is tricky (training/ testing) Interpretability (explaining fraud detection/ad tech) Feature engineering Models decay over time
  • 25. What projects work? Explain existing data (visualization!) Automate repetitive/ slow processes Augment data to make new data (Search engines, ML models) Predict the future (do something more accurately than gut feel ) Simulate using statistics :) (Sports analytics models)
  • 26. "You need data first" - Peadar Coyle Copying and pasting PDF/PNG data Getting data in some areas is hard!! Some tools for web data extraction Messy APIs without documentation :(
  • 27. Visualise ALL THE THINGS!! (Relay foods dataset - HT Greg Reda) Consumer behaviour at a Fast Food Restaurant per year in the USA
  • 28.
  • 29. Augmenting data and using API's
  • 31. Production data/ Real world HT: Andrew Ng
  • 32. Everyone ETLS Only 1% of your time will be spent modelling Data pipelines and your infrastructure matters - Tools such as Spark, Luigi, Airflow or Dask are important Monitoring of pipelines. Scalability. Machine provision Eoin Brazil Talk
  • 33. Success Story (1) 1. Ad Tech project (Media company) - original process took 10 hours per week of analyst time. 2. Broke down process and produced simple predictive model 3. Data type challenges (unicode in Python 2.7) and business expectations 4. Now takes less than 30 minutes each week.
  • 34. Success Story (2) Elevate is a Total Talent Management platform (tools for streamlining recruitment) Several models in production - microservices (recommenders, job type prediction) Data pipeline (Luigi) necessary for producing features Lots of challenges about Docker, time, updated data. Works: MyPy (Python 3.5), code review, testing - Next: Moving to PySpark for ETL, more pair programming http://hypothesis.works/
  • 35. Failure 1. Lack of support from tech teams 2. Lack of clearly defined customer 3. No culture of DevOps 4. Management had no clear vision
  • 36.
  • 37. Lessons learned from Lab to Factory 1. Monitoring - data products need evaluation in production 2. Lack of a shared language between software engineers and data scientists. Pair programming/ Code review are some solutions. 3. To help data scientists and analysts succeed your business needs to be prepared to invest in tooling. (Pipelines, DevOps, Reproducible) 4. Often you're working with other teams who use different languages - so micro services can be a good idea (example C#.net app and Python microservice)
  • 38. How to deploy a model? Palladium (Otto Group) Azure Flask Microservice (DIY) and Docker
  • 40.
  • 41. We need each other!
  • 42. Use small data where possible!! Small problems with clean data are more important - (Ian Ozsvald) Amazon machine with many Xeons and 244GB of RAM is less than 3 euros per hour. - (Ian Ozsvald) Blaze, Xray, Dask, Ibis, etc etc - "The mean size of a cluster will remain 1" - Matt Rocklin PyData Bikeshed
  • 43. Focus on the real problem
  • 44. Closing remarks Dirty data stops projects Change happens, we need to help businesses adapt or be agile There are some good projects like Icy, Luigi, etc for transforming data and improving data extraction - Google paper on rule for building ML systems http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
  • 45. Closing Remarks (2) Stakeholder management is a challenge too Send me your dirty data and data deployment stories :) www.peadarcoyle.com https://leanpub.com/interviewswithdatascientists
  • 47.
  • 48. Simulate: Rugby games with MCMC (PyMC3)
  • 49. What is the Data Science process? Obtain Scrub Explore Model Interpret Communicate (or Deploy)
  • 50. Some NLP on the Interviews!
  • 51. What do Data Scientists talk about? Based on my Interview series!Dataconomy
  • 52. A famous 'data product' - Recommendation engines
  • 53.