Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Lab to Factory: Creating value with data

544 views

Published on

One of the biggest challenges in Data Science, is deploying Machine Learning models. There are cultural and technological challenges and I'll explain these and share some insights/ solutions.

Published in: Technology
  • Be the first to comment

From Lab to Factory: Creating value with data

  1. 1. From Lab to Factory LessonslearneddevelopingDataProducts Keynote PyCon Colombia Feb 2017 peadarcoyle@googlemail.com All opinions my own
  2. 2. Who am I? Data Scientist at Elevate (Recruitment Startup) and Author @springcoil About 5 years in Machine Learning/Data Science OSS contributor (mostly PyMC3) Built Ad-hoc analysis, Data Products and Data Pipelines at
  3. 3. Why talk about this?
  4. 4. Data Moats
  5. 5. Roadmap What is Data Science? Type of Data Scientists Deploying Data Science (tech/culture) Success stories and recommendations Bigger idea (Change)
  6. 6. What IS a Data Scientist?
  7. 7. There's a Data Science Spectrum HT: Sean J. Taylor (Facebook Research Scientist)
  8. 8. Data Science to Value Data doesn't do anything! People need to make decisions (strategic/ tactical) Or the product needs to alter someones decision making (Spotify Discover Weekly, Amazon Recommendations, etc)
  9. 9. (Taken from Josh Wills talk - Lab to Factory 2014) or https://hbr.org/2013/04/two-departments-for-data-succe
  10. 10. Expectations and reality
  11. 11. Data is the new oil... But. Data isn't a product. It needs to be turned into one!
  12. 12. Data Science is hard Data and technical problems Skill set (emotional and tech) Machine Learning Ops
  13. 13. Data Science projects are risky! Many stakeholders think that data science is just an engineering problem, but research is high risk and high reward De-risking the project - how? Send me examples :) http://www.martingoodson.com/ten-ways-your-data-project-is-going-to- fail/ https://github.com/ianozsvald/data_science_delivered
  14. 14. Success with Data Science 1. People 2. Ideas 3. Things Source: USAF
  15. 15. R and D needs cultural support
  16. 16. Your culture needs to avoid the HiPPO (highest paid persons opinion) to be data-informed or data- driven.
  17. 17. Good cultures Data democratization Clear objectives and metrics against objectives Financial metrics don't always win - sometimes brand wins See by Martin Goodsonhttp://bit.ly/cultureanddata
  18. 18. Some data scientists do experiments and build prototypes
  19. 19. Production
  20. 20. (HT: The Yhat people - )www.yhathq.com
  21. 21. Deploying Data Products is hard Monitoring and Alerting Deployment Real world data is tricky (training/ testing) Interpretability (explaining fraud detection/ad tech) Feature engineering Models decay over time
  22. 22. What projects work? Explain existing data (visualization!) Automate repetitive/ slow processes Augment data to make new data (Search engines, ML models) Predict the future (do something more accurately than gut feel ) Simulate using statistics :) (Sports analytics models)
  23. 23. "You need data first" - Peadar Coyle Copying and pasting PDF/PNG data Getting data in some areas is hard!! Some tools for web data extraction Messy APIs without documentation :(
  24. 24. Visualise ALL THE THINGS!! (Relay foods dataset - HT Greg Reda) Consumer behaviour at a Fast Food Restaurant per year in the USA
  25. 25. Augmenting data and using API's
  26. 26. Machine Learning
  27. 27. Production data/ Real world HT: Andrew Ng
  28. 28. Everyone ETLS Only 1% of your time will be spent modelling Data pipelines and your infrastructure matters - Tools such as Spark, Luigi, Airflow or Dask are important Monitoring of pipelines. Scalability. Machine provision Eoin Brazil Talk
  29. 29. Success Story (1) 1. Ad Tech project (Media company) - original process took 10 hours per week of analyst time. 2. Broke down process and produced simple predictive model 3. Data type challenges (unicode in Python 2.7) and business expectations 4. Now takes less than 30 minutes each week.
  30. 30. Success Story (2) Elevate is a Total Talent Management platform (tools for streamlining recruitment) Several models in production - microservices (recommenders, job type prediction) Data pipeline (Luigi) necessary for producing features Lots of challenges about Docker, time, updated data. Works: MyPy (Python 3.5), code review, testing - Next: Moving to PySpark for ETL, more pair programming http://hypothesis.works/
  31. 31. Failure 1. Lack of support from tech teams 2. Lack of clearly defined customer 3. No culture of DevOps 4. Management had no clear vision
  32. 32. Lessons learned from Lab to Factory 1. Monitoring - data products need evaluation in production 2. Lack of a shared language between software engineers and data scientists. Pair programming/ Code review are some solutions. 3. To help data scientists and analysts succeed your business needs to be prepared to invest in tooling. (Pipelines, DevOps, Reproducible) 4. Often you're working with other teams who use different languages - so micro services can be a good idea (example C#.net app and Python microservice)
  33. 33. How to deploy a model? Palladium (Otto Group) Azure Flask Microservice (DIY) and Docker
  34. 34. Dev Ops
  35. 35. We need each other!
  36. 36. Use small data where possible!! Small problems with clean data are more important - (Ian Ozsvald) Amazon machine with many Xeons and 244GB of RAM is less than 3 euros per hour. - (Ian Ozsvald) Blaze, Xray, Dask, Ibis, etc etc - "The mean size of a cluster will remain 1" - Matt Rocklin PyData Bikeshed
  37. 37. Focus on the real problem
  38. 38. Closing remarks Dirty data stops projects Change happens, we need to help businesses adapt or be agile There are some good projects like Icy, Luigi, etc for transforming data and improving data extraction - Google paper on rule for building ML systems http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
  39. 39. Closing Remarks (2) Stakeholder management is a challenge too Send me your dirty data and data deployment stories :) www.peadarcoyle.com https://leanpub.com/interviewswithdatascientists
  40. 40. A New World
  41. 41. Simulate: Rugby games with MCMC (PyMC3)
  42. 42. What is the Data Science process? Obtain Scrub Explore Model Interpret Communicate (or Deploy)
  43. 43. Some NLP on the Interviews!
  44. 44. What do Data Scientists talk about? Based on my Interview series!Dataconomy
  45. 45. A famous 'data product' - Recommendation engines
  46. 46. Credits http://cdn.yourarticlelibrary.com/wp-content/uploads/2013/12/080.jpg https://www.flickr.com/photos/82066314@N06/7624218468/sizes/z/

×