Machine learning model to production

Georg Heiler
Georg Heilerdata scientist @T-Mobile Austria
prototype -> production
Make your ML app rock
Agenda
• Problems with current workflow
• Interactive exploration to enterprise API
• Data Science Platforms
• My recommendation
About me @geoHeil
• Data Scientist at T-Mobile Austria
• Business Informatics at Vienna University of Technology
• Built predictive startup (predictr.eu)
• Data science projects at university
Ed, 41
Professional developer
Cares about Testing, CI,
stability
John, 28
Phd. cool kid
Wants to build
awesome app
Simple?
Goal: smart application improves business processes
John’s
Smart app
Ed’s
Business
process
Simple?
Goal: smart application improves business processes
Ed’s
Business
process
ML modes: similarity of environments?
Exploration
• Flexibility
• Easy to use
• reusability
Production
• Performance
• Scalability
• Monitoring
• API
Interaction required to improve business process
ML modes
from https://www.youtube.com/watch?v=R-6nAwLyWCI
flexibility performance
Stackup
Problems
• Move to production means
redevelopment from scratch
Solutions
• Notebooks as API
Prototype problem at current project
Easy move to the JVM?
Consultant
R
Me
Python
Production
JVM
native C dependencies
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
Solutions
• Notebooks as API
• Re develop from scratch
Prototype problem at current project
Easy move to the JVM?
Consultant
R
Me
Python
Production
JVM
native C dependencies
Data exchange possibilities (API)
Pickle – python only
Hadoop file formats (avro/parquet)
Thrift, protobuf
Message queue
REST
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
Solutions
• Notebooks as API
• Use analytics via an API
Big data starts at
20GB. Want to use
fancy hadoop cluster
We can buy a
server with 6 TB
RAM
3 types of big data
1. Fits in memory (6 TB of RAM …)
2. Raw data too large for memory, but aggregated data works
well
3. Too big => ml needs to be big as well
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
• Enterprise operations handle JVM
only
• Inflexible big data tools
Solutions
• Notebooks as API
• Use analytics via an API
• Your data is not “really big” and
still fits in memory
Security is
not my job
Disagree /
infoSec
Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
• Inflexible big data tools
• Security not taken care of
Solutions
• Notebooks as API
• Use analytics via an API
• Your data is not “really big” and
still fits in memory ->keep using
python / R / notebooks
• Kerberized hadoop cluster :(
Exploration to
Enterprise API
small data & R prototype
Separation of concerns.
Startup data science – predicting cash flows
• Custom backend (JVM)
• Data science and via an API (OpenCPU / R )
• Partly in backend (Renjin)
Other possibilities
• JNI (java native interface) :(
• JNA (java native access)
• Rkafka (did not have a MQ in infrastructure)
• Custom service (rest call) to JNA enabled server (too
costly)
Music streaming
Anomaly detection big data
Machine learning model to production
Machine learning model to production
Machine learning model to production
Source
https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
project facts
• We were using a ms-sql backup (600 GB)
• Spark + parquet compressed it to 3 GB
• No cluster during development of the project, only laptops
+ 60 GB RAM server
• Most of the time spent in garbage collection (15 sec on
real cluster, 17 Minutes on laptop)
Data science stack
• Type 2 big data (aggregation allows for local in memory
processing in python/R)
• Spark as (REST) API
POST /jars/app_name jobserver:port/jars/myjob
POST jobserver:port/contexts/context_for_myapp
POST "paramKey = paramValue"
jobserver:port/myjob?appName=myjob&classPaht=path.to.main&con
text=context_for_myapp
• Aggregated data fed to R via REST-API
Frontend Backend
Data-science
SQL aggregation / spark job-server
Spark cluster
Laptop J
R
via opencpu
Spark aggregaton & R as API
REST call
API
incompatibilities
L
Data science platform
Can the architecture be simplified?
Cloud solutions
• Notebook as API: Databricks workflows / Domino data lab
• Google, Microsoft, Amazon
• Several data science platform startups bigml, dataiku,
...
(+) cluster deploy on click
(+) some integrate notebooks well
(-) control over data?
What is missing?
Custom models, Control over data,
Testing, CI, AB testing, retraining
Several solutions – same problem
Lets try lean
Back to spark architecture overview …
Missing API layer / model deployment
Hydrospheredata/mist notebook, CI -> e2e
CI & testing +1
Notebook e2e +1
But again: a lot of
moving parts
Highly experimental
Seldon –e2e ml platform for enterprise
Seldon architecture
K8s for high availability
Hot model deployments
A-B testing
Holdout group
Containerized micro
services conforming to
seldon’s REST API
Overall verygood
But: outdated python
2.xx
Kubernetes
mandatory
In an ideal world
What I dream of …
Whish list
• Flexibility to experiment (notebooks)on big enough
hardware
• Make these easily available as an API in a pre-production
environment to gain quick business feedback
• A-B testing, holdout group, containers
• More “developer” mindset (Testing, CI, security) for data
scientists
Reality is different.
How I will move forward with my current
project
Write a JVM-based custom backend which operations and existing developers
can maintain. Apparently this is a better fit than a platform turnkey solution.
How to integrate spark?
Spark deployment modes revisited ...
Spark deployment scenarios
• Batch / bulk prediction in cluster -> job scheduling
overhead
• Long running spark application?(SJS, pipeline persistence
àlocal spark context)
• Predictive service without spark
• PMML? jpmml/sklearn2pmml
• scoring without spark -> mleap and SPARK-16365
What is your approach?
Thanks. @geoHeil
PMML - Openscoring
• Based on PMML (predictive markup model language)
(+) stay in java/xml world (enterprise operations J)
(+) quick predictions
(+) mature
(-) not all models suitable for PMML / some algorithms not
implemented
(-) xml
PMML + retraining oryx.io
prediction.IO
h2o steam
E2e platform
Build + deploy
interoparbility
Enterprise
permissions
Based on h2o-flow
Machine learning model to production
pipeline.io notebook à
prediction, e2e
“Extend ml pipelines to
serve production users“
How do tools stack up regarding security?
https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
Python (what I learnt later on)
• Easily can deployed on its own (if ops can handle this)
• Python4j/ pyspark/ spylon?
Science in Python, production in java – spylon, Video
• Bring code via custom UDF to data in pySpark
• Model = fitted sk-learn model
• Requires model to be parallelizable
others
• Jupyter notebook to REST API (IBM interactive dashboard
http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/)
• Apache toree (interactive spark as notebook)
1 of 58

Recommended

Building Data Pipelines with Spark and StreamSets by
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
5K views26 slides
Drifting Away: Testing ML Models in Production by
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
1.2K views41 slides
Building Big Data Applications using Spark, Hive, HBase and Kafka by
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaAshish Thapliyal
1.2K views48 slides
Introduction to the SharePoint 2013 User Profile Service by
Introduction to the SharePoint 2013 User Profile ServiceIntroduction to the SharePoint 2013 User Profile Service
Introduction to the SharePoint 2013 User Profile ServiceRegroove
15.1K views25 slides
Full Isolation in Multi-Tenant SAAS with Kubernetes & Istio by
Full Isolation in Multi-Tenant SAAS with Kubernetes & IstioFull Isolation in Multi-Tenant SAAS with Kubernetes & Istio
Full Isolation in Multi-Tenant SAAS with Kubernetes & IstioDevOps Indonesia
2.6K views35 slides
Introducing DataFrames in Spark for Large Scale Data Science by
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
41K views39 slides

More Related Content

What's hot

Building a Feature Store around Dataframes and Apache Spark by
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
1.4K views31 slides
Airflow tutorials hands_on by
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
91 views24 slides
End-to-end Data Pipeline with Apache Spark by
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
6.9K views25 slides
Apache Atlas: Governance for your Data by
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your DataDataWorks Summit/Hadoop Summit
12.4K views20 slides
Building Open Data Lakes on AWS with Debezium and Apache Hudi by
Building Open Data Lakes on AWS with Debezium and Apache HudiBuilding Open Data Lakes on AWS with Debezium and Apache Hudi
Building Open Data Lakes on AWS with Debezium and Apache HudiGary Stafford
137 views21 slides
Running Apache Spark on Kubernetes: Best Practices and Pitfalls by
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
2.9K views36 slides

What's hot(20)

Building a Feature Store around Dataframes and Apache Spark by Databricks
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
Databricks1.4K views
Airflow tutorials hands_on by pko89403
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko8940391 views
End-to-end Data Pipeline with Apache Spark by Databricks
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks6.9K views
Building Open Data Lakes on AWS with Debezium and Apache Hudi by Gary Stafford
Building Open Data Lakes on AWS with Debezium and Apache HudiBuilding Open Data Lakes on AWS with Debezium and Apache Hudi
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Gary Stafford137 views
Running Apache Spark on Kubernetes: Best Practices and Pitfalls by Databricks
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks2.9K views
Pragmatic REST APIs by amesar0
Pragmatic REST APIsPragmatic REST APIs
Pragmatic REST APIs
amesar05.4K views
Machine Learning and the Elastic Stack by Yann Cluchey
Machine Learning and the Elastic StackMachine Learning and the Elastic Stack
Machine Learning and the Elastic Stack
Yann Cluchey4.3K views
Presto best practices for Cluster admins, data engineers and analysts by Shubham Tagra
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
Shubham Tagra316 views
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin by Sébastien Le Marchand
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - BerlinUsing MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
Alfresco勉強会#33 alfresco 5.1でコンテンツ自動削除を実装してみた by Tasuku Otani
Alfresco勉強会#33 alfresco 5.1でコンテンツ自動削除を実装してみたAlfresco勉強会#33 alfresco 5.1でコンテンツ自動削除を実装してみた
Alfresco勉強会#33 alfresco 5.1でコンテンツ自動削除を実装してみた
Tasuku Otani7.5K views
Monitoring modern applications using Elastic by Elasticsearch
Monitoring modern applications using ElasticMonitoring modern applications using Elastic
Monitoring modern applications using Elastic
Elasticsearch561 views
Logic Apps reuse with microservices design by BizTalk360
Logic Apps reuse with microservices designLogic Apps reuse with microservices design
Logic Apps reuse with microservices design
BizTalk3601.7K views
Knowledge Graphs - The Power of Graph-Based Search by Neo4j
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
Neo4j3.3K views
Data Lakehouse, Data Mesh, and Data Fabric (r2) by James Serra
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra6.3K views
Cloud DW technology trends and considerations for enterprises to apply snowflake by SANG WON PARK
Cloud DW technology trends and considerations for enterprises to apply snowflakeCloud DW technology trends and considerations for enterprises to apply snowflake
Cloud DW technology trends and considerations for enterprises to apply snowflake
SANG WON PARK1.4K views
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO by Chris Mungall
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOLinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
Chris Mungall2.7K views
HBase Application Performance Improvement by Biju Nair
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
Biju Nair23.5K views

Viewers also liked

Square's Machine Learning Infrastructure and Applications - Rong Yan by
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
4.1K views43 slides
Machine Learning In Production by
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
5.8K views54 slides
Machine Learning Pipelines by
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
25.4K views32 slides
Managing and Versioning Machine Learning Models in Python by
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
7.8K views43 slides
Introduction to streaming and messaging flume,kafka,SQS,kinesis by
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
2.1K views29 slides
Practical Machine Learning Pipelines with MLlib by
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
8.1K views35 slides

Viewers also liked(11)

Square's Machine Learning Infrastructure and Applications - Rong Yan by Hakka Labs
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
Hakka Labs4.1K views
Machine Learning In Production by Samir Bessalah
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
Samir Bessalah5.8K views
Machine Learning Pipelines by jeykottalam
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam25.4K views
Managing and Versioning Machine Learning Models in Python by Simon Frid
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Simon Frid7.8K views
Introduction to streaming and messaging flume,kafka,SQS,kinesis by Omid Vahdaty
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty2.1K views
Practical Machine Learning Pipelines with MLlib by Databricks
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks8.1K views
Online Machine Learning: introduction and examples by Felipe
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
Felipe 12.2K views
Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str... by confluent
Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str...Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str...
Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str...
confluent8.5K views
Introduction to Big Data/Machine Learning by Lars Marius Garshol
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol306.6K views
What is Big Data? by Bernard Marr
What is Big Data?What is Big Data?
What is Big Data?
Bernard Marr585.3K views

Similar to Machine learning model to production

IBM Strategy for Spark by
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
752 views40 slides
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ... by
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
216 views38 slides
Building and deploying LLM applications with Apache Airflow by
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
96 views29 slides
Webinar september 2013 by
Webinar september 2013Webinar september 2013
Webinar september 2013Marc Gille
1.7K views13 slides
Real time data viz with Spark Streaming, Kafka and D3.js by
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
13.3K views22 slides
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming by
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
7.6K views76 slides

Similar to Machine learning model to production(20)

IBM Strategy for Spark by Mark Kerzner
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner752 views
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ... by Jason Dai
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai216 views
Building and deploying LLM applications with Apache Airflow by Kaxil Naik
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
Kaxil Naik96 views
Webinar september 2013 by Marc Gille
Webinar september 2013Webinar september 2013
Webinar september 2013
Marc Gille1.7K views
Real time data viz with Spark Streaming, Kafka and D3.js by Ben Laird
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird13.3K views
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming by Paco Nathan
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan7.6K views
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models by Anyscale
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale6K views
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ... by Big Data Spain
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain1.1K views
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio by Alluxio, Inc.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.2.4K views
Apache Spark for Everyone - Women Who Code Workshop by Amanda Casari
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari147 views
SnappyData Toronto Meetup Nov 2017 by SnappyData
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData241 views
Machine Learning Infrastructure by SigOpt
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
SigOpt621 views
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google... by James Anderson
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson413 views
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.... by Databricks
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks21.5K views
Low Latency Polyglot Model Scoring using Apache Apex by Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex901 views
Deploying Data Science Engines to Production by Mostafa Majidpour
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour267 views
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data by Databricks
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks1.9K views
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps) by Neotys_Partner
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Neotys_Partner441 views
OSCON 2014: Data Workflows for Machine Learning by Paco Nathan
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan16.9K views

Recently uploaded

Short Story Assignment by Kelly Nguyen by
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyenkellynguyen01
20 views17 slides
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
6 views12 slides
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptxDataScienceConferenc1
5 views21 slides
Data Journeys Hard Talk workshop final.pptx by
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptxinfo828217
11 views18 slides
shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
5 views14 slides
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf by
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdfDataScienceConferenc1
5 views54 slides

Recently uploaded(20)

Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0120 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf by DataScienceConferenc1
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus27 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20048 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...

Machine learning model to production

  • 1. prototype -> production Make your ML app rock
  • 2. Agenda • Problems with current workflow • Interactive exploration to enterprise API • Data Science Platforms • My recommendation
  • 3. About me @geoHeil • Data Scientist at T-Mobile Austria • Business Informatics at Vienna University of Technology • Built predictive startup (predictr.eu) • Data science projects at university
  • 4. Ed, 41 Professional developer Cares about Testing, CI, stability John, 28 Phd. cool kid Wants to build awesome app
  • 5. Simple? Goal: smart application improves business processes John’s Smart app Ed’s Business process
  • 6. Simple? Goal: smart application improves business processes Ed’s Business process
  • 7. ML modes: similarity of environments? Exploration • Flexibility • Easy to use • reusability Production • Performance • Scalability • Monitoring • API Interaction required to improve business process ML modes
  • 9. Stackup Problems • Move to production means redevelopment from scratch Solutions • Notebooks as API
  • 10. Prototype problem at current project Easy move to the JVM? Consultant R Me Python Production JVM native C dependencies
  • 11. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only Solutions • Notebooks as API • Re develop from scratch
  • 12. Prototype problem at current project Easy move to the JVM? Consultant R Me Python Production JVM native C dependencies
  • 13. Data exchange possibilities (API) Pickle – python only Hadoop file formats (avro/parquet) Thrift, protobuf Message queue REST
  • 14. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only Solutions • Notebooks as API • Use analytics via an API
  • 15. Big data starts at 20GB. Want to use fancy hadoop cluster We can buy a server with 6 TB RAM
  • 16. 3 types of big data 1. Fits in memory (6 TB of RAM …) 2. Raw data too large for memory, but aggregated data works well 3. Too big => ml needs to be big as well
  • 17. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only • Enterprise operations handle JVM only • Inflexible big data tools Solutions • Notebooks as API • Use analytics via an API • Your data is not “really big” and still fits in memory
  • 18. Security is not my job Disagree / infoSec
  • 19. Stackup Problems • Move to production means redevelopment from scratch • Enterprise operations handle JVM only • Inflexible big data tools • Security not taken care of Solutions • Notebooks as API • Use analytics via an API • Your data is not “really big” and still fits in memory ->keep using python / R / notebooks • Kerberized hadoop cluster :(
  • 21. small data & R prototype Separation of concerns.
  • 22. Startup data science – predicting cash flows • Custom backend (JVM) • Data science and via an API (OpenCPU / R ) • Partly in backend (Renjin)
  • 23. Other possibilities • JNI (java native interface) :( • JNA (java native access) • Rkafka (did not have a MQ in infrastructure) • Custom service (rest call) to JNA enabled server (too costly)
  • 29. project facts • We were using a ms-sql backup (600 GB) • Spark + parquet compressed it to 3 GB • No cluster during development of the project, only laptops + 60 GB RAM server • Most of the time spent in garbage collection (15 sec on real cluster, 17 Minutes on laptop)
  • 30. Data science stack • Type 2 big data (aggregation allows for local in memory processing in python/R) • Spark as (REST) API POST /jars/app_name jobserver:port/jars/myjob POST jobserver:port/contexts/context_for_myapp POST "paramKey = paramValue" jobserver:port/myjob?appName=myjob&classPaht=path.to.main&con text=context_for_myapp • Aggregated data fed to R via REST-API
  • 31. Frontend Backend Data-science SQL aggregation / spark job-server Spark cluster Laptop J R via opencpu Spark aggregaton & R as API REST call API incompatibilities L
  • 32. Data science platform Can the architecture be simplified?
  • 33. Cloud solutions • Notebook as API: Databricks workflows / Domino data lab • Google, Microsoft, Amazon • Several data science platform startups bigml, dataiku, ... (+) cluster deploy on click (+) some integrate notebooks well (-) control over data?
  • 34. What is missing? Custom models, Control over data, Testing, CI, AB testing, retraining
  • 35. Several solutions – same problem
  • 36. Lets try lean Back to spark architecture overview …
  • 37. Missing API layer / model deployment
  • 39. CI & testing +1 Notebook e2e +1 But again: a lot of moving parts Highly experimental
  • 40. Seldon –e2e ml platform for enterprise
  • 41. Seldon architecture K8s for high availability Hot model deployments A-B testing Holdout group Containerized micro services conforming to seldon’s REST API Overall verygood But: outdated python 2.xx Kubernetes mandatory
  • 42. In an ideal world What I dream of …
  • 43. Whish list • Flexibility to experiment (notebooks)on big enough hardware • Make these easily available as an API in a pre-production environment to gain quick business feedback • A-B testing, holdout group, containers • More “developer” mindset (Testing, CI, security) for data scientists
  • 44. Reality is different. How I will move forward with my current project
  • 45. Write a JVM-based custom backend which operations and existing developers can maintain. Apparently this is a better fit than a platform turnkey solution.
  • 46. How to integrate spark? Spark deployment modes revisited ...
  • 47. Spark deployment scenarios • Batch / bulk prediction in cluster -> job scheduling overhead • Long running spark application?(SJS, pipeline persistence àlocal spark context) • Predictive service without spark • PMML? jpmml/sklearn2pmml • scoring without spark -> mleap and SPARK-16365
  • 48. What is your approach? Thanks. @geoHeil
  • 49. PMML - Openscoring • Based on PMML (predictive markup model language) (+) stay in java/xml world (enterprise operations J) (+) quick predictions (+) mature (-) not all models suitable for PMML / some algorithms not implemented (-) xml
  • 50. PMML + retraining oryx.io
  • 52. h2o steam E2e platform Build + deploy interoparbility Enterprise permissions Based on h2o-flow
  • 54. pipeline.io notebook à prediction, e2e “Extend ml pipelines to serve production users“
  • 55. How do tools stack up regarding security? https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
  • 56. Python (what I learnt later on) • Easily can deployed on its own (if ops can handle this) • Python4j/ pyspark/ spylon?
  • 57. Science in Python, production in java – spylon, Video • Bring code via custom UDF to data in pySpark • Model = fitted sk-learn model • Requires model to be parallelizable
  • 58. others • Jupyter notebook to REST API (IBM interactive dashboard http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/) • Apache toree (interactive spark as notebook)

Editor's Notes

  1. Hi Georg. Talk about how to not have a smart prototype script rot in the corner. First talk ;) Question: Who has played with machine learning who is familiar with R / python? Who is using big data technology in production? Who is drving business decisions with ML?
  2. Discussion about how you deploy models
  3. Apache Toree, Jupyter notebooks as REST api (IBM)
  4. Notebooks can execute JVM code as well