Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bridging the Gap: from Data Science to Production


Published on

A recent but quite common observation in industry is that although there is an overall high adoption of data science, many companies struggle to get it into production. Huge teams of well-payed data scientists often present one fancy model after the other to their managers but their proof of concepts never manifest into something business relevant. The frustration grows on both sides, managers and data scientists.

In my talk I elaborate on the many reasons why data science to production is such a hard nut to crack. I start with a taxonomy of data use cases in order to easier assess technical requirements. Based thereon, my focus lies on overcoming the two-language-problem which is Python/R loved by data scientists vs. the enterprise-established Java/Scala. From my project experiences I present three different solutions, namely 1) migrating to a single language, 2) reimplementation and 3) usage of a framework. The advantages and disadvantages of each approach is presented and general advices based on the introduced taxonomy is given.

Additionally, my talk also addresses organisational as well as problems in quality assurance and deployment. Best practices and further references are presented on a high-level in order to cover all facets of data science to production.

With my talk I hope to convey the message that breakdowns on the road from data science to production are rather the rule than the exception, so you are not alone. At the end of my talk, you will have a better understanding of why your team and you are struggling and what to do about it.

Published in: Science
  • Be the first to comment

Bridging the Gap: from Data Science to Production

  1. 1. Bridging the Gap: From Data Science to Production Florian Wilhelm EuroPython 2018 @ Edinburgh, 2018-07-25
  2. 2. Special Interests • Mathematical Modelling • Recommendation Systems • Data Science in Production • Python Data Stack Dr. Florian Wilhelm Principal Data Scientist @ inovex @FlorianWilhelm FlorianWilhelm 2
  3. 3. IT-project house for digital transformation: ‣ Agile Development & Management ‣ Web · UI/UX · Replatforming · Microservices ‣ Mobile · Apps · Smart Devices · Robotics ‣ Big Data & Business Intelligence Platforms ‣ Data Science · Data Products · Search · Deep Learning ‣ Data Center Automation · DevOps · Cloud · Hosting ‣ Trainings & Coachings Using technology to inspire our clients. And ourselves. inovex offices in Karlsruhe · Cologne · Munich · Pforzheim · Hamburg · Stuttgart.
  4. 4. Agenda Many facets Data Science to Production Organisation Quality Assurance Deployment Use-Case Languages 4
  5. 5. Use-Case: High Level Perspective What does your model pipeline look like? f(...)Data Results Model 5
  6. 6. Use-Case: High Level Perspective What is your Data Source? Data Variants: • Database (PostgreSQL, C*) • Distributed Filesystem (HDFS) • Stream (Kafka) • ... How is your data accessed? What are the frequency and recency requirements? Batch, Near-Realtime, Realtime, Stream? 6
  7. 7. Use-Case: High Level Perspective What is a model? Model Model includes: • Preprocessing (cleansing, imputation, scaling) • Construction of derived features (EMAs) • Machine Learning Algorithm (Random Forest, ANN) • ... Is the input of your model raw data or pregenerated features? Does your model have a state? 7
  8. 8. Use-Case: High Level Perspective How is your result stored? Results Variants: • Database (PostgreSQL, C*) • Distributed Filesystem (HDFS) • Stream (Kafka) • On demand (REST API) • ... What are the frequency and recency requirements? Batch, Near-Realtime, Realtime, Stream? 8
  9. 9. Use-Case: High Level Perspective Our challenge ModelData Results Deployment Interface Interface Production 9
  10. 10. Use-Case Evaluation Delivery Problem Class Volume & Velocity Inference / Prediction Technical Conditions WebService Classification 10 GB weekly Batch Java-Stack + Python Stream Regression 1 GB daily Near-Realtime On-Premise Database Recommendation 10k events/s Realtime AWS Cloud Explainability? Stream Characteristics of a Data Use-case Note down your specific requirements before thinking about an architecture. There is no one size fits all! 10
  11. 11. Use-Case: High Level Perspective Right from the Start • State the requirements of your data use-case • Identify and check data sources • Define interfaces with other teams/departments • Test the whole data flow and infrastructure early on with a dummy model 11
  12. 12. Many facets Data Science to Production Organisation Quality Assurance Deployment Use-Case Languages 12
  13. 13. It‘s an iterative Process Quality Assurance for smooth iterations Data Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment CRISP-DM 13
  14. 14. Clean Code Clean code is code that is easy to understand and easy to change. Resources: • Software Design Patterns • SOLID Principles • The Pragmatic Programmer • The Software Craftsman 14
  15. 15. Continuous Integration • Version, package and manage your artefacts • Provide tests (unit, systems, ...) • Automize as much as possible • Embrace processes 15
  16. 16. Monitoring KPI and Stats • KPIs (CTR, Conversions) • Number of requests • Timeouts, delays • Total number of predictions • Runtimes • ... All monitoring needs to be linked to the currently running version of your model! 16
  17. 17. 17 Monitoring Site Reliability Engineering How Google Runs Production Systems @Google
  18. 18. Lucas Javier Bernardi | Diagnosing Machine Learning Models: Monitoring Model Stats Monitor the results of model‘s predicitons Example: Response Distribution Analysis a) working model b) confused model a) b) 18
  19. 19. A/B Tests Feedback for your model • Always compare your “improved“ model to the current baseline • Allows comparing two models not only in offline metrics but also online metrics and KPIs. • Also possible to adjust hyperparameters with online feedback, e.g. multi-armed bandit 19
  20. 20. A/B Tests Technical requirements • Versioning of your models to allow linking them to test groups • Deploying and serving several models at the same time independently (needed for fast rollback anyway!) • Tracking the results of a given model up to the point of facing the customer Serving 20
  21. 21. Many facets Data Science to Production Organisation Quality Assurance Deployment Use-Case Languages 21
  22. 22. Sebastian Neubauer - There should be one obvious way to bring python into production Organisation of Teams Wall of Confusion • Code • Tests • Releases • Version Control • Continuous Integration • Features Developers • Packaging • Deployment • Lifecycle • Configuration • Security • Monitoring Operations Release v1.2.3 22
  23. 23. Different Cultures/Thinking Wrong Approach! • Especially dangerous separation for data products/features • Speed and Time to Market are important thus “not my job“-thinking hurts • “I made a great model“ vs. „We made a great data product“ 23
  24. 24. Organisation of Teams Overcoming the Wall of Confusion Continuous Delivery Dev Ops 24
  25. 25. Heterogeneous Teams How to bring Data Scientists into DevOps? • Pure teams of Data Scientists often struggle to get anything in production • As a minimum complement, SW and Data Engineers are needed. (2-3 Engineers per Data Scientist) • Optionally a Data Product Manager as well as an UI/UX expert if necessary 25
  26. 26. Organisation around Features Responsibility with vertical teams • Fully autonomous teams • End-to-end responsibility for a feature • Works well with Agile Methods like Scrum • Faster delivery and less politics 26
  27. 27. Many facets Data Science to Production Organisation Quality Assurance Deployment Use-Case Languages 27
  28. 28. Programming Languages The Two Language Problem Industry • Java stack common also Scala • Strongly typed • Emphasize on robustness and edge cases • Industrial standards for deployment Science • Often Python and R • Dynamic typed since easier to get the job done • Emphasize on fancy methods and results • Runs on my machine 28
  29. 29. Language Problem Solution: Select one to rule them all! • Having a single language reduces the complexity of deployment • Implementation efforts due to abandoning one ecosystem totally 29
  30. 30. Language Problem Solution: Python in production • Especially easy for batch prediction use-cases • If a web service is needed flask is a viable option • Scale horizontally during prediction and use a big metal node for training a model • Tap into the Hadoop world by using PySpark, PyHive etc. • Consider isolated containers using docker 30
  31. 31. Language Problem Solution: PoCs in Python/R, rewrite in Java for production • Lots of efforts and slow • Iterations and new feature are hard to implement • Reproducability of bugs is cumbersome • Pro: Everyone gets what they want Worst-case Scenario 31
  32. 32. Language Problem Solution: Exchangable formats • Works great in theory • Limited functionality and flexibility • No guarantee the same model description will be interpreted the same by two different implementations • Preprocessing / feature generation not included 32
  33. 33. Language Problem Solution: Frameworks • Various language bindings allow developing in Python/R and running on the Java stack • Be aware if framework also covers feature generation • Ease of use at the cost of flexibility 33
  34. 34. Two Language Problem Three concepts of dealing with it Reimplement Frameworks Single Language 1 2 3 x 34
  35. 35. Many facets Data Science to Production Organisation Quality Assurance Deployment Use-Case Languages 35
  36. 36. Deployment Deployment heavily depends on the chosen approach! Still some software engineering principles apply like Continuous Integration or even Continuous Delivery 36
  37. 37. Sculley et al (2015), Hidden Technical Debt in Machine Learning Systems Technical Debt in ML Pipelines Deployment 37
  38. 38. Deployment General principles • Versioning & packaging, defined processes, quality management • Keep the development and production environment as similar as possible • Automation is a must, avoid human error! • Isolated and controllable environments are a great idea, e.g. Docker. 38
  39. 39. 39 Google‘s Best Practices for ML Engineering Best of Google‘s rules Rule #1: Don’t be afraid to launch a product without machine learning. Rule #2: First, design and implement metrics Rule #4: Keep the first model simple and get the infrastructure right. Rule #5: Test the infrastructure independently from the machine learning Rule #9: Detect problems before exporting models. Rule #11: Give feature columns owners and documentation. Rule #13: Choose a simple, observable and attributable metric for your first objective. Rule #14: Starting with an interpretable model makes debugging easier. Rule #16: Plan to launch and iterate. Rule #24: Measure the delta between models. Rule #27: Try to quantify observed undesirable behavior. "measure first, optimize second“ Rule #32: Re-use code between your training pipeline and your serving pipeline whenever possible. Most of the problems are engineering problems!
  40. 40. Example: Continuous Integration 40 devpi
  41. 41. Example: Python Package/Distribution PyScaffold • Easy and sane Python packaging • Proper versioning of every commit • Git integration, e.g. pre-commit • Declarative configuration with setup.cfg • Follows community standards • Many extensions available $> pip install pyscaffold $> putup my_project 41
  42. 42. Key Learnings Data Science to Production Data Science to Production Organisation Quality Assurance Deployment Use-Case Languages • Dependent on your use-case, no one-size fits all! • Think early on about QA • DevOps Culture & team responsibility • Choose a framework or single language to overcome the Two- Language-Problem • Embrace processes & automation 42 Production is NOT an Afterthought!
  43. 43. Thank you! Florian Wilhelm Principal Data Scientist inovex GmbH Schanzenstraße 6-20 Kupferhütte 1.13 51063 Cologne, Germany