Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine learning model to production

3,691 views

Published on

How to bring a machine learning script into a production environment

Published in: Data & Analytics
  • There are over 16,000 woodworking plans that comes with step-by-step instructions and detailed photos, Click here to take a look  http://tinyurl.com/y3hc8gpw
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • There are over 16,000 woodworking plans that comes with step-by-step instructions and detailed photos, Click here to take a look ➤➤ http://tinyurl.com/y3hc8gpw
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Machine learning model to production

  1. 1. prototype -> production Make your ML app rock
  2. 2. Agenda • Problems with current workflow • Interactive exploration to enterprise API • Data Science Platforms • My recommendation
  3. 3. About me @geoHeil • Data Scientist at T-Mobile Austria • Business Informatics at Vienna University of Technology • Built predictive startup (predictr.eu) • Data science projects at university
  4. 4. Ed, 41 Professional developer Cares about Testing, CI, stability John, 28 Phd. cool kid Wants to build awesome app
  5. 5. Simple? Goal: smart application improves business processes John’s Smart app Ed’s Business process
  6. 6. Simple? Goal: smart application improves business processes Ed’s Business process
  7. 7. ML modes: similarity of environments? Exploration • Flexibility • Easy to use • reusability Production • Performance • Scalability • Monitoring • API Interaction required to improve business process ML modes
  8. 8. from https://www.youtube.com/watch?v=R-6nAwLyWCI flexibility performance
  9. 9. Stackup Problems • Move to production means redevelopment from scratch Solutions • Notebooks as API
  10. 10. Prototype problem at current project Easy move to the JVM? Consultant R Me Python Production JVM native C dependencies
  11. 11. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only Solutions • Notebooks as API • Re develop from scratch
  12. 12. Prototype problem at current project Easy move to the JVM? Consultant R Me Python Production JVM native C dependencies
  13. 13. Data exchange possibilities (API) Pickle – python only Hadoop file formats (avro/parquet) Thrift, protobuf Message queue REST
  14. 14. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only Solutions • Notebooks as API • Use analytics via an API
  15. 15. Big data starts at 20GB. Want to use fancy hadoop cluster We can buy a server with 6 TB RAM
  16. 16. 3 types of big data 1. Fits in memory (6 TB of RAM …) 2. Raw data too large for memory, but aggregated data works well 3. Too big => ml needs to be big as well
  17. 17. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only • Enterprise operations handle JVM only • Inflexible big data tools Solutions • Notebooks as API • Use analytics via an API • Your data is not “really big” and still fits in memory
  18. 18. Security is not my job Disagree / infoSec
  19. 19. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only • Inflexible big data tools • Security not taken care of Solutions • Notebooks as API • Use analytics via an API • Your data is not “really big” and still fits in memory ->keep using python / R / notebooks • Kerberized hadoop cluster :(
  20. 20. Exploration to Enterprise API
  21. 21. small data & R prototype Separation of concerns.
  22. 22. Startup data science – predicting cash flows • Custom backend (JVM) • Data science and via an API (OpenCPU / R ) • Partly in backend (Renjin)
  23. 23. Other possibilities • JNI (java native interface) :( • JNA (java native access) • Rkafka (did not have a MQ in infrastructure) • Custom service (rest call) to JNA enabled server (too costly)
  24. 24. Music streaming Anomaly detection big data
  25. 25. Source https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
  26. 26. project facts • We were using a ms-sql backup (600 GB) • Spark + parquet compressed it to 3 GB • No cluster during development of the project, only laptops + 60 GB RAM server • Most of the time spent in garbage collection (15 sec on real cluster, 17 Minutes on laptop)
  27. 27. Data science stack • Type 2 big data (aggregation allows for local in memory processing in python/R) • Spark as (REST) API POST /jars/app_name jobserver:port/jars/myjob POST jobserver:port/contexts/context_for_myapp POST "paramKey = paramValue" jobserver:port/myjob?appName=myjob&classPaht=path.to.main&con text=context_for_myapp • Aggregated data fed to R via REST-API
  28. 28. Frontend Backend Data-science SQL aggregation / spark job-server Spark cluster Laptop J R via opencpu Spark aggregaton & R as API REST call API incompatibilities L
  29. 29. Data science platform Can the architecture be simplified?
  30. 30. Cloud solutions • Notebook as API: Databricks workflows / Domino data lab • Google, Microsoft, Amazon • Several data science platform startups bigml, dataiku, ... (+) cluster deploy on click (+) some integrate notebooks well (-) control over data?
  31. 31. What is missing? Custom models, Control over data, Testing, CI, AB testing, retraining
  32. 32. Several solutions – same problem
  33. 33. Lets try lean Back to spark architecture overview …
  34. 34. Missing API layer / model deployment
  35. 35. Hydrospheredata/mist notebook, CI -> e2e
  36. 36. CI & testing +1 Notebook e2e +1 But again: a lot of moving parts Highly experimental
  37. 37. Seldon –e2e ml platform for enterprise
  38. 38. Seldon architecture K8s for high availability Hot model deployments A-B testing Holdout group Containerized micro services conforming to seldon’s REST API Overall verygood But: outdated python 2.xx Kubernetes mandatory
  39. 39. In an ideal world What I dream of …
  40. 40. Whish list • Flexibility to experiment (notebooks)on big enough hardware • Make these easily available as an API in a pre-production environment to gain quick business feedback • A-B testing, holdout group, containers • More “developer” mindset (Testing, CI, security) for data scientists
  41. 41. Reality is different. How I will move forward with my current project
  42. 42. Write a JVM-based custom backend which operations and existing developers can maintain. Apparently this is a better fit than a platform turnkey solution.
  43. 43. How to integrate spark? Spark deployment modes revisited ...
  44. 44. Spark deployment scenarios • Batch / bulk prediction in cluster -> job scheduling overhead • Long running spark application?(SJS, pipeline persistence àlocal spark context) • Predictive service without spark • PMML? jpmml/sklearn2pmml • scoring without spark -> mleap and SPARK-16365
  45. 45. What is your approach? Thanks. @geoHeil
  46. 46. PMML - Openscoring • Based on PMML (predictive markup model language) (+) stay in java/xml world (enterprise operations J) (+) quick predictions (+) mature (-) not all models suitable for PMML / some algorithms not implemented (-) xml
  47. 47. PMML + retraining oryx.io
  48. 48. prediction.IO
  49. 49. h2o steam E2e platform Build + deploy interoparbility Enterprise permissions Based on h2o-flow
  50. 50. pipeline.io notebook à prediction, e2e “Extend ml pipelines to serve production users“
  51. 51. How do tools stack up regarding security? https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
  52. 52. Python (what I learnt later on) • Easily can deployed on its own (if ops can handle this) • Python4j/ pyspark/ spylon?
  53. 53. Science in Python, production in java – spylon, Video • Bring code via custom UDF to data in pySpark • Model = fitted sk-learn model • Requires model to be parallelizable
  54. 54. others • Jupyter notebook to REST API (IBM interactive dashboard http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/) • Apache toree (interactive spark as notebook)

×