Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
prototype -> production
Make your ML app rock
Agenda
• Problems with current workflow
• Interactive exploration to enterprise API
• Data Science Platforms
• My recommen...
About me @geoHeil
• Data Scientist at T-Mobile Austria
• Business Informatics at Vienna University of Technology
• Built p...
Ed, 41
Professional developer
Cares about Testing, CI,
stability
John, 28
Phd. cool kid
Wants to build
awesome app
Simple?
Goal: smart application improves business processes
John’s
Smart app
Ed’s
Business
process
Simple?
Goal: smart application improves business processes
Ed’s
Business
process
ML modes: similarity of environments?
Exploration
• Flexibility
• Easy to use
• reusability
Production
• Performance
• Sca...
from https://www.youtube.com/watch?v=R-6nAwLyWCI
flexibility performance
Stackup
Problems
• Move to production means
redevelopment from scratch
Solutions
• Notebooks as API
Prototype problem at current project
Easy move to the JVM?
Consultant
R
Me
Python
Production
JVM
native C dependencies
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
Solutions
•...
Prototype problem at current project
Easy move to the JVM?
Consultant
R
Me
Python
Production
JVM
native C dependencies
Data exchange possibilities (API)
Pickle – python only
Hadoop file formats (avro/parquet)
Thrift, protobuf
Message queue
R...
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
Solutions
•...
Big data starts at
20GB. Want to use
fancy hadoop cluster
We can buy a
server with 6 TB
RAM
3 types of big data
1. Fits in memory (6 TB of RAM …)
2. Raw data too large for memory, but aggregated data works
well
3. ...
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
• Enterpris...
Security is
not my job
Disagree /
infoSec
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
• Inflexibl...
Exploration to
Enterprise API
small data & R prototype
Separation of concerns.
Startup data science – predicting cash flows
• Custom backend (JVM)
• Data science and via an API (OpenCPU / R )
• Partly ...
Other possibilities
• JNI (java native interface) :(
• JNA (java native access)
• Rkafka (did not have a MQ in infrastruct...
Music streaming
Anomaly detection big data
Source
https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
project facts
• We were using a ms-sql backup (600 GB)
• Spark + parquet compressed it to 3 GB
• No cluster during develop...
Data science stack
• Type 2 big data (aggregation allows for local in memory
processing in python/R)
• Spark as (REST) API...
Frontend Backend
Data-science
SQL aggregation / spark job-server
Spark cluster
Laptop J
R
via opencpu
Spark aggregaton & R...
Data science platform
Can the architecture be simplified?
Cloud solutions
• Notebook as API: Databricks workflows / Domino data lab
• Google, Microsoft, Amazon
• Several data scien...
What is missing?
Custom models, Control over data,
Testing, CI, AB testing, retraining
Several solutions – same problem
Lets try lean
Back to spark architecture overview …
Missing API layer / model deployment
Hydrospheredata/mist notebook, CI -> e2e
CI & testing +1
Notebook e2e +1
But again: a lot of
moving parts
Highly experimental
Seldon –e2e ml platform for enterprise
Seldon architecture
K8s for high availability
Hot model deployments
A-B testing
Holdout group
Containerized micro
services...
In an ideal world
What I dream of …
Whish list
• Flexibility to experiment (notebooks)on big enough
hardware
• Make these easily available as an API in a pre-...
Reality is different.
How I will move forward with my current
project
Write a JVM-based custom backend which operations and existing developers
can maintain. Apparently this is a better fit th...
How to integrate spark?
Spark deployment modes revisited ...
Spark deployment scenarios
• Batch / bulk prediction in cluster -> job scheduling
overhead
• Long running spark applicatio...
What is your approach?
Thanks. @geoHeil
PMML - Openscoring
• Based on PMML (predictive markup model language)
(+) stay in java/xml world (enterprise operations J)...
PMML + retraining oryx.io
prediction.IO
h2o steam
E2e platform
Build + deploy
interoparbility
Enterprise
permissions
Based on h2o-flow
pipeline.io notebook à
prediction, e2e
“Extend ml pipelines to
serve production users“
How do tools stack up regarding security?
https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
Python (what I learnt later on)
• Easily can deployed on its own (if ops can handle this)
• Python4j/ pyspark/ spylon?
Science in Python, production in java – spylon, Video
• Bring code via custom UDF to data in pySpark
• Model = fitted sk-l...
others
• Jupyter notebook to REST API (IBM interactive dashboard
http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as...
Machine learning model to production
Machine learning model to production
Machine learning model to production
Machine learning model to production
Upcoming SlideShare
Loading in …5
×

Machine learning model to production

How to bring a machine learning script into a production environment

  • Be the first to comment

Machine learning model to production

  1. 1. prototype -> production Make your ML app rock
  2. 2. Agenda • Problems with current workflow • Interactive exploration to enterprise API • Data Science Platforms • My recommendation
  3. 3. About me @geoHeil • Data Scientist at T-Mobile Austria • Business Informatics at Vienna University of Technology • Built predictive startup (predictr.eu) • Data science projects at university
  4. 4. Ed, 41 Professional developer Cares about Testing, CI, stability John, 28 Phd. cool kid Wants to build awesome app
  5. 5. Simple? Goal: smart application improves business processes John’s Smart app Ed’s Business process
  6. 6. Simple? Goal: smart application improves business processes Ed’s Business process
  7. 7. ML modes: similarity of environments? Exploration • Flexibility • Easy to use • reusability Production • Performance • Scalability • Monitoring • API Interaction required to improve business process ML modes
  8. 8. from https://www.youtube.com/watch?v=R-6nAwLyWCI flexibility performance
  9. 9. Stackup Problems • Move to production means redevelopment from scratch Solutions • Notebooks as API
  10. 10. Prototype problem at current project Easy move to the JVM? Consultant R Me Python Production JVM native C dependencies
  11. 11. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only Solutions • Notebooks as API • Re develop from scratch
  12. 12. Prototype problem at current project Easy move to the JVM? Consultant R Me Python Production JVM native C dependencies
  13. 13. Data exchange possibilities (API) Pickle – python only Hadoop file formats (avro/parquet) Thrift, protobuf Message queue REST
  14. 14. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only Solutions • Notebooks as API • Use analytics via an API
  15. 15. Big data starts at 20GB. Want to use fancy hadoop cluster We can buy a server with 6 TB RAM
  16. 16. 3 types of big data 1. Fits in memory (6 TB of RAM …) 2. Raw data too large for memory, but aggregated data works well 3. Too big => ml needs to be big as well
  17. 17. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only • Enterprise operations handle JVM only • Inflexible big data tools Solutions • Notebooks as API • Use analytics via an API • Your data is not “really big” and still fits in memory
  18. 18. Security is not my job Disagree / infoSec
  19. 19. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only • Inflexible big data tools • Security not taken care of Solutions • Notebooks as API • Use analytics via an API • Your data is not “really big” and still fits in memory ->keep using python / R / notebooks • Kerberized hadoop cluster :(
  20. 20. Exploration to Enterprise API
  21. 21. small data & R prototype Separation of concerns.
  22. 22. Startup data science – predicting cash flows • Custom backend (JVM) • Data science and via an API (OpenCPU / R ) • Partly in backend (Renjin)
  23. 23. Other possibilities • JNI (java native interface) :( • JNA (java native access) • Rkafka (did not have a MQ in infrastructure) • Custom service (rest call) to JNA enabled server (too costly)
  24. 24. Music streaming Anomaly detection big data
  25. 25. Source https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
  26. 26. project facts • We were using a ms-sql backup (600 GB) • Spark + parquet compressed it to 3 GB • No cluster during development of the project, only laptops + 60 GB RAM server • Most of the time spent in garbage collection (15 sec on real cluster, 17 Minutes on laptop)
  27. 27. Data science stack • Type 2 big data (aggregation allows for local in memory processing in python/R) • Spark as (REST) API POST /jars/app_name jobserver:port/jars/myjob POST jobserver:port/contexts/context_for_myapp POST "paramKey = paramValue" jobserver:port/myjob?appName=myjob&classPaht=path.to.main&con text=context_for_myapp • Aggregated data fed to R via REST-API
  28. 28. Frontend Backend Data-science SQL aggregation / spark job-server Spark cluster Laptop J R via opencpu Spark aggregaton & R as API REST call API incompatibilities L
  29. 29. Data science platform Can the architecture be simplified?
  30. 30. Cloud solutions • Notebook as API: Databricks workflows / Domino data lab • Google, Microsoft, Amazon • Several data science platform startups bigml, dataiku, ... (+) cluster deploy on click (+) some integrate notebooks well (-) control over data?
  31. 31. What is missing? Custom models, Control over data, Testing, CI, AB testing, retraining
  32. 32. Several solutions – same problem
  33. 33. Lets try lean Back to spark architecture overview …
  34. 34. Missing API layer / model deployment
  35. 35. Hydrospheredata/mist notebook, CI -> e2e
  36. 36. CI & testing +1 Notebook e2e +1 But again: a lot of moving parts Highly experimental
  37. 37. Seldon –e2e ml platform for enterprise
  38. 38. Seldon architecture K8s for high availability Hot model deployments A-B testing Holdout group Containerized micro services conforming to seldon’s REST API Overall verygood But: outdated python 2.xx Kubernetes mandatory
  39. 39. In an ideal world What I dream of …
  40. 40. Whish list • Flexibility to experiment (notebooks)on big enough hardware • Make these easily available as an API in a pre-production environment to gain quick business feedback • A-B testing, holdout group, containers • More “developer” mindset (Testing, CI, security) for data scientists
  41. 41. Reality is different. How I will move forward with my current project
  42. 42. Write a JVM-based custom backend which operations and existing developers can maintain. Apparently this is a better fit than a platform turnkey solution.
  43. 43. How to integrate spark? Spark deployment modes revisited ...
  44. 44. Spark deployment scenarios • Batch / bulk prediction in cluster -> job scheduling overhead • Long running spark application?(SJS, pipeline persistence àlocal spark context) • Predictive service without spark • PMML? jpmml/sklearn2pmml • scoring without spark -> mleap and SPARK-16365
  45. 45. What is your approach? Thanks. @geoHeil
  46. 46. PMML - Openscoring • Based on PMML (predictive markup model language) (+) stay in java/xml world (enterprise operations J) (+) quick predictions (+) mature (-) not all models suitable for PMML / some algorithms not implemented (-) xml
  47. 47. PMML + retraining oryx.io
  48. 48. prediction.IO
  49. 49. h2o steam E2e platform Build + deploy interoparbility Enterprise permissions Based on h2o-flow
  50. 50. pipeline.io notebook à prediction, e2e “Extend ml pipelines to serve production users“
  51. 51. How do tools stack up regarding security? https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
  52. 52. Python (what I learnt later on) • Easily can deployed on its own (if ops can handle this) • Python4j/ pyspark/ spylon?
  53. 53. Science in Python, production in java – spylon, Video • Bring code via custom UDF to data in pySpark • Model = fitted sk-learn model • Requires model to be parallelizable
  54. 54. others • Jupyter notebook to REST API (IBM interactive dashboard http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/) • Apache toree (interactive spark as notebook)

×