SlideShare a Scribd company logo
DBG / June 5, 2018 / © 2018 IBM Corporation
Productionizing

Spark ML Pipelines
with PFA
Nick Pentreath
Principal Engineer
@MLnick
DBG / June 5, 2018 / © 2018 IBM Corporation
About
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data & AI
Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
DBG / June 5, 2018 / © 2018 IBM Corporation
Center for Open Source Data and AI Technologies
CODAIT
codait.org
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open Source
DBG / June 5, 2018 / © 2018 IBM Corporation
Agenda
The Machine Learning Workflow
Challenges of ML Deployment
Portable Format for Analytics
PFA for Spark ML
Performance Comparisons
Summary and Future Directions
DBG / June 5, 2018 / © 2018 IBM Corporation
The Machine Learning
Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
Perception
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
In reality the workflow spans teams …
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
… and tools …
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
… and is a small (but critical!) piece of
the puzzle
The Machine Learning Workflow
*Source: Hidden Technical Debt in Machine Learning Systems
DBG / June 5, 2018 / © 2018 IBM Corporation
Machine Learning
Deployment
DBG / June 5, 2018 / © 2018 IBM Corporation
Challenges
• Proliferation of formats
• Lack of standardization => custom solutions
• Existing standards have limitations
Machine Learning Deployment
• Need to manage and bridge many different:
• Languages - Python, R, Notebooks, Scala / Java / C
• Frameworks – too many to count!
• Dependencies
• Versions
• Performance characteristics highly variable
across these dimensions
• Friction between teams
• Data scientists & researchers – latest & greatest
• Production – stability, control, performance
• Business – metrics, business impact, working product!
DBG / June 5, 2018 / © 2018 IBM Corporation
Spark solves many problems …
Machine Learning Deployment
DBG / June 5, 2018 / © 2018 IBM Corporation
… but introduces additional challenges
• Currently, in order to use trained models
outside of Spark, users must:
• Write custom readers for Spark’s native format; or
• Create their own custom format; or
• Export to a standard format (limited support =>
custom solution)
• Scoring outside of Spark also requires custom
translation layer between Spark and another
ML library
Everything is custom!
Machine Learning Deployment
• Tight coupling to Spark runtime
• Introduces complex dependencies
• Managing version & compatibility issues
• Scoring models in Spark is slow
• Overhead of DataFrames, especially query planning
• Overhead of task scheduling, even locally
• Optimized for batch scoring (includes streaming
“micro-batch” settings)
• Spark is not suitable for real-time scoring (<
few 100ms latency)
DBG / June 5, 2018 / © 2018 IBM Corporation
The Portable Format for
Analytics
DBG / June 5, 2018 / © 2018 IBM Corporation
Overview
• PFA is being championed by the Data Mining
Group (IBM is a founding member)
• DMG previously created PMML (Predictive
Model Markup Language), arguably the only
viable open standard currently
• PMML has many limitations
• PFA was created specifically to address these
shortcomings
• PFA consists of:
• JSON serialization format
• AVRO schemas for data types
• Encodes functions (actions) that are applied to inputs
to create outputs with a set of built-in functions and
language constructs (e.g. control-flow, conditionals)
• Essentially a mini functional math language + schema
specification
• Type and function system means PFA can be
fully & statically verified on load and run by
any compliant execution engine
• => portability across languages, frameworks,
run times and versions
Portable Format for Analytics
DBG / June 5, 2018 / © 2018 IBM Corporation
A Simple Example
Portable Format for Analytics
• Example – multi-class logistic regression
• Specify input and output types using Avro
schemas
• Specify the action to perform (typically on
input)
DBG / June 5, 2018 / © 2018 IBM Corporation
Managing State
Portable Format for Analytics
• Data storage specified by cells
• A cell is a named value acting as a global variable
• Typically used to store state (such as model
coefficients, vocabulary mappings, etc)
• Types specified with Avro schemas
• Cell values are mutable within an action, but
immutable between action executions of a given PFA
document
• Persistent storage specified by pools
• Closer in concept to a database
• Pools values are mutable across action executions
DBG / June 5, 2018 / © 2018 IBM Corporation
Other Features
Portable Format for Analytics
• Special forms
• Control structure – conditionals & loops
• Creating and manipulating local variables
• User-defined functions including lambdas
• Casts
• Null checks
• (Very) basic try-catch, user-defined errors and logs
• Comprehensive built-in function library
• Math, strings, arrays, maps, stats, linear algebra
• Built-in support for some common models - decision
tree, clustering, linear models
DBG / June 5, 2018 / © 2018 IBM Corporation
Aardpfark
PFA and Spark ML
• PFA export for Spark ML pipelines
• aardpfark-core: Scala DSL for creating PFA
documents
• avro4s to generate schemas from case classes;
json4s to serialize PFA document to JSON
• aardpfark-sparkml: uses DSL to export Spark ML
components and pipelines to PFA
DBG / June 5, 2018 / © 2018 IBM Corporation
Aardpfark - Challenges
PFA and Spark ML
• Spark ML Model has no schema knowledge
• E.g. Binarizer can operate on numeric or vector
columns
• Need to use Avro union types for standalone PFA
components and handle all cases in the action logic
• Combining components into a pipeline
• Trying to match Spark’s DataFrame-based input/output
behavior (typically appending columns)
• Each component is wrapped as a user-defined function
in the PFA document
• Current approach mimics passing a Row (i.e. Avro
record) from function to function, adding fields
• Missing features in PFA
• Generic vector support (mixed dense/sparse)
DBG / June 5, 2018 / © 2018 IBM Corporation
Related Open Standards
DBG / June 5, 2018 / © 2018 IBM Corporation
PMML
Related Open Standards
• Data Mining Group (DMG)
• Model interchange format in XML with
operators
• Widely used and supported; open standard
• Spark support lacking natively but 3rd party
projects available: jpmml-sparkml
• Comprehensive support for Spark ML components
(perhaps surprisingly!)
• Watch SPARK-11237
• Other exporters include scikit-learn, R,
XGBoost and LightGBM
• Shortcomings
• Cannot represent arbitrary programs / analytic
applications
• Flexibility comes from custom plugins => lose
benefits of standardization
DBG / June 5, 2018 / © 2018 IBM Corporation
MLeap
Related Open Standards
• Created by Combust.ML, a startup focused on
ML model serving
• Model interchange format in JSON / Protobuf
• Components implemented in Scala code
• Good performance
• Initially focused on Spark ML. Offers almost
complete support for Spark ML components
• Recently added some sklearn; working on
TensorFlow
• Shortcomings
• “Open” format, but not a “standard”
• No concept of well-defined operators / functions
• Effectively forces a tight coupling between
versions of model producer / consumer
• Must implement custom components in Scala
• Impossible to statically verify serialized model
DBG / June 5, 2018 / © 2018 IBM Corporation
Open Neural Network Exchange (ONNX)
• Championed by Facebook & Microsoft
• Protobuf serialization format
• Describes computation graph (including
operators)
• In this way the serialized graph is “self-describing”
similarly to PFA
• More focused on Deep Learning / tensor
operations
• Will be baked into PyTorch 1.0.0 / Caffe2 as
the serialization & interchange format
• Shortcomings
• No or poor support for more “traditional” ML or
language constructs (currently)
• Tree-based models & ensembles
• String / categorical processing
• Control flow
• Intermediate variables
Related Open Standards
DBG / June 5, 2018 / © 2018 IBM Corporation
Performance
DBG / June 5, 2018 / © 2018 IBM Corporation
Scoring Performance Comparison
• Comparing scoring performance of PFA with
Spark and MLeap
• PFA uses Hadrian reference implementation
for JVM
• Test dataset of ~80,000 records
• String indexing of 47 categorical columns
• Vector assembling the 47 categorical indices together
with 27 numerical columns
• Linear regression predictor
• Note: Spark time is 1.9s / record (1901ms) -
not shown on the chart
Performance
DBG / June 5, 2018 / © 2018 IBM Corporation
Conclusion
DBG / June 5, 2018 / © 2018 IBM Corporation
Summary
Conclusion
• PFA provides an open standard for
serialization and deployment of analytic
workflows
• True portability across languages, frameworks,
runtimes and versions
• Execution environment is independent of the producer
(R, scikit-learn, Spark ML, weka, etc)
• Solves a significant pain point for the
deployment of Spark ML pipelines
• Also benefits the wider ecosystem
• e.g. many currently use PMML for exporting models
from R, scikit-learn, XGBoost, LightGBM, etc.
• However there are risks
• PFA is still young and needs to gain adoption
• Performance in production, at scale, is relatively
untested
• Tests indicate PFA reference engines need some work
on robustness and performance
• What about Deep Learning / comparison to ONNX?
• Limitations of PFA
• A standard can move slowly in terms of new features,
fixes and enhancements
DBG / June 5, 2018 / © 2018 IBM Corporation
Open-sourcing Aardpfark
Conclusion
• Coverage
• Almost all predictors (ML models)
• Many feature transformers
• Pipeline support (still needs work)
• Equivalence tests Spark <-> PFA
• Tests for core Scala DSL
https://github.com/CODAIT/aardpfark
• Need your help!
• Finish implementing components
• Improve pipeline support
• Complete Scala DSL for PFA
• Python support
• Tests, docs, testing it out!
DBG / June 5, 2018 / © 2018 IBM Corporation
Future directions
Conclusion
• Initially focused on Spark ML pipelines
• Later add support for scikit-learn pipelines, XGBoost,
LightGBM, etc
• (Support for many R models exist already in the Hadrian
project)
• Further performance testing in progress vs
Spark & MLeap; also test vs PMML
• More automated translation (Scala -> PFA,
ASTs etc)
• Propose improvements to PFA
• Generic vector (tensor) support
• Less cumbersome schema definitions
• Performance improvements to scoring engine
• PFA for Deep Learning?
• Comparing to ONNX and other emerging standards
• Better suited for the more general pre-processing
steps of DL pipelines
• Requires all the various DL-specific operators
• Requires tensor schema and better tensor support
built-in to the PFA spec
• Should have GPU support
DBG / June 5, 2018 / © 2018 IBM Corporation
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
developer.ibm.com/code
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://datascience.ibm.com/
MAX
DBG / June 5, 2018 / © 2018 IBM Corporation
Links & References
Portable Format for Analytics
PMML
Spark MLlib – Saving and Loading Pipelines
Hadrian – Reference Implementation of PFA Engines for
JVM, Python, R
jpmml-sparkml
MLeap
Open Neural Network Exchange
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Tue, June 5 | 11 AM
2010-2012, 30 mins
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Nick Pentreath (IBM)
Tue, June 5 | 2 PM
2018, 30 mins
Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing!
Holden Karau (Google) Bryan Cutler (IBM)
Tue, June 5 | 2 PM
Nook by 2001, 30 mins
Making Data and AI Accessible for All
Armand Ruiz Gabernet (IBM)
Tue, June 5 | 2:40 PM
2002-2004, 30 mins
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System
Rajesh Bordawekar (IBM T.J. Watson Research Center)
Tue, June 5 | 3:20 PM
3016-3022, 30 mins
Dynamic Priorities for Apache Spark Application’s Resource Allocations
Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.)
Tue, June 5 | 3:20 PM
2001-2005, 30 mins
Model Parallelism in Spark ML Cross-Validation
Nick Pentreath (IBM) Bryan Cutler (IBM)
Tue, June 5 | 3:20 PM
2007, 30 mins
Serverless Machine Learning on Modern Hardware Using Apache Spark
Patrick Stuedi (IBM)
Tue, June 5 | 5:40 PM
2002-2004, 30 mins
Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine;
Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida)
Tue, June 5 | 5:40 PM
2007, 30 mins
Transparent GPU Exploitation on Apache Spark
Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM)
Tue, June 5 | 5:40 PM
2009-2011, 30 mins
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks
Yonggang Hu (IBM) Chao Xue (IBM)
IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More
Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP)
Wed, June 6 | 2 PM
2002-2004, 30 mins
Deep Learning for Recommender Systems
Nick Pentreath (IBM) )
Wed, June 6 | 3:20 PM
2018, 30 mins
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer
Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies)
IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6)
Meet us at IBM booth in the Expo area.
DBG / June 5, 2018 / © 2018 IBM Corporation

More Related Content

What's hot

IBM Rhapsody and MATLAB/Simulink
IBM Rhapsody and MATLAB/SimulinkIBM Rhapsody and MATLAB/Simulink
IBM Rhapsody and MATLAB/Simulink
gjuljo
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
Nick Pentreath
 
Analytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène LyonAnalytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène Lyon
NRB
 
QVT & MTL In Eclipse
QVT & MTL In EclipseQVT & MTL In Eclipse
QVT & MTL In Eclipse
Jonathan Musset
 
JBPM Past Present Future
JBPM Past Present FutureJBPM Past Present Future
JBPM Past Present Future
Eric D. Schabell
 
Blue Ruby SDN Webinar
Blue Ruby SDN WebinarBlue Ruby SDN Webinar
Blue Ruby SDN Webinar
Juergen Schmerder
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Databricks
 
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
NRB
 
The road ahead for architectural languages [ACVI 2016]
The road ahead for architectural languages [ACVI 2016]The road ahead for architectural languages [ACVI 2016]
The road ahead for architectural languages [ACVI 2016]
Ivano Malavolta
 
API, WhizzML and Apps
API, WhizzML and AppsAPI, WhizzML and Apps
API, WhizzML and Apps
BigML, Inc
 
An Architecture for an XML-Template Engine enabling Safe Authoring
An Architecture for an XML-Template Engine enabling Safe AuthoringAn Architecture for an XML-Template Engine enabling Safe Authoring
An Architecture for an XML-Template Engine enabling Safe Authoring
Falk Hartmann
 
Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Simon Urli
 
Anastasios_Fakas
Anastasios_FakasAnastasios_Fakas
Anastasios_Fakas
Anastasios Fakas
 
Industrial and Academic Experiences with a User Interaction Modeling Language...
Industrial and Academic Experiences with a User Interaction Modeling Language...Industrial and Academic Experiences with a User Interaction Modeling Language...
Industrial and Academic Experiences with a User Interaction Modeling Language...
Marco Brambilla
 
Ibm i-modernization
Ibm i-modernizationIbm i-modernization
Ibm i-modernization
Tom Presotto
 
Evolving Industrial Software Architectures into a Software Product Line: A Ca...
Evolving Industrial Software Architectures into a Software Product Line: A Ca...Evolving Industrial Software Architectures into a Software Product Line: A Ca...
Evolving Industrial Software Architectures into a Software Product Line: A Ca...
Heiko Koziolek
 
Practical Experiences Migrating Unified Modeling Language Models to IBM® Rati...
PracticalExperiences Migrating Unified Modeling Language Models to IBM® Rati...PracticalExperiences Migrating Unified Modeling Language Models to IBM® Rati...
Practical Experiences Migrating Unified Modeling Language Models to IBM® Rati...
Einar Karlsen
 
cv_filustek_en_08
cv_filustek_en_08cv_filustek_en_08
cv_filustek_en_08
Patrik Filustek
 
[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya
[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya
[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya
IBMSSA
 
AD for i in modern world
AD for i in modern worldAD for i in modern world
AD for i in modern world
COMMON Europe
 

What's hot (20)

IBM Rhapsody and MATLAB/Simulink
IBM Rhapsody and MATLAB/SimulinkIBM Rhapsody and MATLAB/Simulink
IBM Rhapsody and MATLAB/Simulink
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
 
Analytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène LyonAnalytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène Lyon
 
QVT & MTL In Eclipse
QVT & MTL In EclipseQVT & MTL In Eclipse
QVT & MTL In Eclipse
 
JBPM Past Present Future
JBPM Past Present FutureJBPM Past Present Future
JBPM Past Present Future
 
Blue Ruby SDN Webinar
Blue Ruby SDN WebinarBlue Ruby SDN Webinar
Blue Ruby SDN Webinar
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
 
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
The NRB Group mainframe day 2021 - Containerisation on Z - Paul Pilotto - Seb...
 
The road ahead for architectural languages [ACVI 2016]
The road ahead for architectural languages [ACVI 2016]The road ahead for architectural languages [ACVI 2016]
The road ahead for architectural languages [ACVI 2016]
 
API, WhizzML and Apps
API, WhizzML and AppsAPI, WhizzML and Apps
API, WhizzML and Apps
 
An Architecture for an XML-Template Engine enabling Safe Authoring
An Architecture for an XML-Template Engine enabling Safe AuthoringAn Architecture for an XML-Template Engine enabling Safe Authoring
An Architecture for an XML-Template Engine enabling Safe Authoring
 
Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...
 
Anastasios_Fakas
Anastasios_FakasAnastasios_Fakas
Anastasios_Fakas
 
Industrial and Academic Experiences with a User Interaction Modeling Language...
Industrial and Academic Experiences with a User Interaction Modeling Language...Industrial and Academic Experiences with a User Interaction Modeling Language...
Industrial and Academic Experiences with a User Interaction Modeling Language...
 
Ibm i-modernization
Ibm i-modernizationIbm i-modernization
Ibm i-modernization
 
Evolving Industrial Software Architectures into a Software Product Line: A Ca...
Evolving Industrial Software Architectures into a Software Product Line: A Ca...Evolving Industrial Software Architectures into a Software Product Line: A Ca...
Evolving Industrial Software Architectures into a Software Product Line: A Ca...
 
Practical Experiences Migrating Unified Modeling Language Models to IBM® Rati...
PracticalExperiences Migrating Unified Modeling Language Models to IBM® Rati...PracticalExperiences Migrating Unified Modeling Language Models to IBM® Rati...
Practical Experiences Migrating Unified Modeling Language Models to IBM® Rati...
 
cv_filustek_en_08
cv_filustek_en_08cv_filustek_en_08
cv_filustek_en_08
 
[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya
[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya
[Uruguay] DB2 for i: 7.1 Overview - Hernando Bedoya
 
AD for i in modern world
AD for i in modern worldAD for i in modern world
AD for i in modern world
 

Similar to Productionizing Spark ML Pipelines with the Portable Format for Analytics with Nick Pentreath

Productionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsProductionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Nick Pentreath
 
Productionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsProductionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analytics
DataWorks Summit
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
Chester Chen
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
iSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java MigrationiSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java Migration
ecubemarketing
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Amazon Web Services
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
AWS Summits
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
Julien SIMON
 
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Amazon Web Services
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
Julien SIMON
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
Amazon Web Services
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
supportlogic
 

Similar to Productionizing Spark ML Pipelines with the Portable Format for Analytics with Nick Pentreath (20)

Productionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsProductionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for Analytics
 
Productionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsProductionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analytics
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
iSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java MigrationiSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java Migration
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 

Recently uploaded (20)

原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 

Productionizing Spark ML Pipelines with the Portable Format for Analytics with Nick Pentreath

  • 1. DBG / June 5, 2018 / © 2018 IBM Corporation Productionizing
 Spark ML Pipelines with PFA Nick Pentreath Principal Engineer @MLnick
  • 2. DBG / June 5, 2018 / © 2018 IBM Corporation About @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups
  • 3. DBG / June 5, 2018 / © 2018 IBM Corporation Center for Open Source Data and AI Technologies CODAIT codait.org CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source
  • 4. DBG / June 5, 2018 / © 2018 IBM Corporation Agenda The Machine Learning Workflow Challenges of ML Deployment Portable Format for Analytics PFA for Spark ML Performance Comparisons Summary and Future Directions
  • 5. DBG / June 5, 2018 / © 2018 IBM Corporation The Machine Learning Workflow
  • 6. DBG / June 5, 2018 / © 2018 IBM Corporation Perception The Machine Learning Workflow
  • 7. DBG / June 5, 2018 / © 2018 IBM Corporation In reality the workflow spans teams … The Machine Learning Workflow
  • 8. DBG / June 5, 2018 / © 2018 IBM Corporation … and tools … The Machine Learning Workflow
  • 9. DBG / June 5, 2018 / © 2018 IBM Corporation … and is a small (but critical!) piece of the puzzle The Machine Learning Workflow *Source: Hidden Technical Debt in Machine Learning Systems
  • 10. DBG / June 5, 2018 / © 2018 IBM Corporation Machine Learning Deployment
  • 11. DBG / June 5, 2018 / © 2018 IBM Corporation Challenges • Proliferation of formats • Lack of standardization => custom solutions • Existing standards have limitations Machine Learning Deployment • Need to manage and bridge many different: • Languages - Python, R, Notebooks, Scala / Java / C • Frameworks – too many to count! • Dependencies • Versions • Performance characteristics highly variable across these dimensions • Friction between teams • Data scientists & researchers – latest & greatest • Production – stability, control, performance • Business – metrics, business impact, working product!
  • 12. DBG / June 5, 2018 / © 2018 IBM Corporation Spark solves many problems … Machine Learning Deployment
  • 13. DBG / June 5, 2018 / © 2018 IBM Corporation … but introduces additional challenges • Currently, in order to use trained models outside of Spark, users must: • Write custom readers for Spark’s native format; or • Create their own custom format; or • Export to a standard format (limited support => custom solution) • Scoring outside of Spark also requires custom translation layer between Spark and another ML library Everything is custom! Machine Learning Deployment • Tight coupling to Spark runtime • Introduces complex dependencies • Managing version & compatibility issues • Scoring models in Spark is slow • Overhead of DataFrames, especially query planning • Overhead of task scheduling, even locally • Optimized for batch scoring (includes streaming “micro-batch” settings) • Spark is not suitable for real-time scoring (< few 100ms latency)
  • 14. DBG / June 5, 2018 / © 2018 IBM Corporation The Portable Format for Analytics
  • 15. DBG / June 5, 2018 / © 2018 IBM Corporation Overview • PFA is being championed by the Data Mining Group (IBM is a founding member) • DMG previously created PMML (Predictive Model Markup Language), arguably the only viable open standard currently • PMML has many limitations • PFA was created specifically to address these shortcomings • PFA consists of: • JSON serialization format • AVRO schemas for data types • Encodes functions (actions) that are applied to inputs to create outputs with a set of built-in functions and language constructs (e.g. control-flow, conditionals) • Essentially a mini functional math language + schema specification • Type and function system means PFA can be fully & statically verified on load and run by any compliant execution engine • => portability across languages, frameworks, run times and versions Portable Format for Analytics
  • 16. DBG / June 5, 2018 / © 2018 IBM Corporation A Simple Example Portable Format for Analytics • Example – multi-class logistic regression • Specify input and output types using Avro schemas • Specify the action to perform (typically on input)
  • 17. DBG / June 5, 2018 / © 2018 IBM Corporation Managing State Portable Format for Analytics • Data storage specified by cells • A cell is a named value acting as a global variable • Typically used to store state (such as model coefficients, vocabulary mappings, etc) • Types specified with Avro schemas • Cell values are mutable within an action, but immutable between action executions of a given PFA document • Persistent storage specified by pools • Closer in concept to a database • Pools values are mutable across action executions
  • 18. DBG / June 5, 2018 / © 2018 IBM Corporation Other Features Portable Format for Analytics • Special forms • Control structure – conditionals & loops • Creating and manipulating local variables • User-defined functions including lambdas • Casts • Null checks • (Very) basic try-catch, user-defined errors and logs • Comprehensive built-in function library • Math, strings, arrays, maps, stats, linear algebra • Built-in support for some common models - decision tree, clustering, linear models
  • 19. DBG / June 5, 2018 / © 2018 IBM Corporation Aardpfark PFA and Spark ML • PFA export for Spark ML pipelines • aardpfark-core: Scala DSL for creating PFA documents • avro4s to generate schemas from case classes; json4s to serialize PFA document to JSON • aardpfark-sparkml: uses DSL to export Spark ML components and pipelines to PFA
  • 20. DBG / June 5, 2018 / © 2018 IBM Corporation Aardpfark - Challenges PFA and Spark ML • Spark ML Model has no schema knowledge • E.g. Binarizer can operate on numeric or vector columns • Need to use Avro union types for standalone PFA components and handle all cases in the action logic • Combining components into a pipeline • Trying to match Spark’s DataFrame-based input/output behavior (typically appending columns) • Each component is wrapped as a user-defined function in the PFA document • Current approach mimics passing a Row (i.e. Avro record) from function to function, adding fields • Missing features in PFA • Generic vector support (mixed dense/sparse)
  • 21. DBG / June 5, 2018 / © 2018 IBM Corporation Related Open Standards
  • 22. DBG / June 5, 2018 / © 2018 IBM Corporation PMML Related Open Standards • Data Mining Group (DMG) • Model interchange format in XML with operators • Widely used and supported; open standard • Spark support lacking natively but 3rd party projects available: jpmml-sparkml • Comprehensive support for Spark ML components (perhaps surprisingly!) • Watch SPARK-11237 • Other exporters include scikit-learn, R, XGBoost and LightGBM • Shortcomings • Cannot represent arbitrary programs / analytic applications • Flexibility comes from custom plugins => lose benefits of standardization
  • 23. DBG / June 5, 2018 / © 2018 IBM Corporation MLeap Related Open Standards • Created by Combust.ML, a startup focused on ML model serving • Model interchange format in JSON / Protobuf • Components implemented in Scala code • Good performance • Initially focused on Spark ML. Offers almost complete support for Spark ML components • Recently added some sklearn; working on TensorFlow • Shortcomings • “Open” format, but not a “standard” • No concept of well-defined operators / functions • Effectively forces a tight coupling between versions of model producer / consumer • Must implement custom components in Scala • Impossible to statically verify serialized model
  • 24. DBG / June 5, 2018 / © 2018 IBM Corporation Open Neural Network Exchange (ONNX) • Championed by Facebook & Microsoft • Protobuf serialization format • Describes computation graph (including operators) • In this way the serialized graph is “self-describing” similarly to PFA • More focused on Deep Learning / tensor operations • Will be baked into PyTorch 1.0.0 / Caffe2 as the serialization & interchange format • Shortcomings • No or poor support for more “traditional” ML or language constructs (currently) • Tree-based models & ensembles • String / categorical processing • Control flow • Intermediate variables Related Open Standards
  • 25. DBG / June 5, 2018 / © 2018 IBM Corporation Performance
  • 26. DBG / June 5, 2018 / © 2018 IBM Corporation Scoring Performance Comparison • Comparing scoring performance of PFA with Spark and MLeap • PFA uses Hadrian reference implementation for JVM • Test dataset of ~80,000 records • String indexing of 47 categorical columns • Vector assembling the 47 categorical indices together with 27 numerical columns • Linear regression predictor • Note: Spark time is 1.9s / record (1901ms) - not shown on the chart Performance
  • 27. DBG / June 5, 2018 / © 2018 IBM Corporation Conclusion
  • 28. DBG / June 5, 2018 / © 2018 IBM Corporation Summary Conclusion • PFA provides an open standard for serialization and deployment of analytic workflows • True portability across languages, frameworks, runtimes and versions • Execution environment is independent of the producer (R, scikit-learn, Spark ML, weka, etc) • Solves a significant pain point for the deployment of Spark ML pipelines • Also benefits the wider ecosystem • e.g. many currently use PMML for exporting models from R, scikit-learn, XGBoost, LightGBM, etc. • However there are risks • PFA is still young and needs to gain adoption • Performance in production, at scale, is relatively untested • Tests indicate PFA reference engines need some work on robustness and performance • What about Deep Learning / comparison to ONNX? • Limitations of PFA • A standard can move slowly in terms of new features, fixes and enhancements
  • 29. DBG / June 5, 2018 / © 2018 IBM Corporation Open-sourcing Aardpfark Conclusion • Coverage • Almost all predictors (ML models) • Many feature transformers • Pipeline support (still needs work) • Equivalence tests Spark <-> PFA • Tests for core Scala DSL https://github.com/CODAIT/aardpfark • Need your help! • Finish implementing components • Improve pipeline support • Complete Scala DSL for PFA • Python support • Tests, docs, testing it out!
  • 30. DBG / June 5, 2018 / © 2018 IBM Corporation Future directions Conclusion • Initially focused on Spark ML pipelines • Later add support for scikit-learn pipelines, XGBoost, LightGBM, etc • (Support for many R models exist already in the Hadrian project) • Further performance testing in progress vs Spark & MLeap; also test vs PMML • More automated translation (Scala -> PFA, ASTs etc) • Propose improvements to PFA • Generic vector (tensor) support • Less cumbersome schema definitions • Performance improvements to scoring engine • PFA for Deep Learning? • Comparing to ONNX and other emerging standards • Better suited for the more general pre-processing steps of DL pipelines • Requires all the various DL-specific operators • Requires tensor schema and better tensor support built-in to the PFA spec • Should have GPU support
  • 31. DBG / June 5, 2018 / © 2018 IBM Corporation Thank you! codait.org twitter.com/MLnick github.com/MLnick developer.ibm.com/code FfDL Sign up for IBM Cloud and try Watson Studio! https://datascience.ibm.com/ MAX
  • 32. DBG / June 5, 2018 / © 2018 IBM Corporation Links & References Portable Format for Analytics PMML Spark MLlib – Saving and Loading Pipelines Hadrian – Reference Implementation of PFA Engines for JVM, Python, R jpmml-sparkml MLeap Open Neural Network Exchange
  • 33. DBG / June 5, 2018 / © 2018 IBM Corporation Date, Time, Location & Duration Session title and Speaker Tue, June 5 | 11 AM 2010-2012, 30 mins Productionizing Spark ML Pipelines with the Portable Format for Analytics Nick Pentreath (IBM) Tue, June 5 | 2 PM 2018, 30 mins Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing! Holden Karau (Google) Bryan Cutler (IBM) Tue, June 5 | 2 PM Nook by 2001, 30 mins Making Data and AI Accessible for All Armand Ruiz Gabernet (IBM) Tue, June 5 | 2:40 PM 2002-2004, 30 mins Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System Rajesh Bordawekar (IBM T.J. Watson Research Center) Tue, June 5 | 3:20 PM 3016-3022, 30 mins Dynamic Priorities for Apache Spark Application’s Resource Allocations Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.) Tue, June 5 | 3:20 PM 2001-2005, 30 mins Model Parallelism in Spark ML Cross-Validation Nick Pentreath (IBM) Bryan Cutler (IBM) Tue, June 5 | 3:20 PM 2007, 30 mins Serverless Machine Learning on Modern Hardware Using Apache Spark Patrick Stuedi (IBM) Tue, June 5 | 5:40 PM 2002-2004, 30 mins Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine; Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida) Tue, June 5 | 5:40 PM 2007, 30 mins Transparent GPU Exploitation on Apache Spark Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM) Tue, June 5 | 5:40 PM 2009-2011, 30 mins Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks Yonggang Hu (IBM) Chao Xue (IBM) IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
  • 34. DBG / June 5, 2018 / © 2018 IBM Corporation Date, Time, Location & Duration Session title and Speaker Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP) Wed, June 6 | 2 PM 2002-2004, 30 mins Deep Learning for Recommender Systems Nick Pentreath (IBM) ) Wed, June 6 | 3:20 PM 2018, 30 mins Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies) IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6) Meet us at IBM booth in the Expo area.
  • 35. DBG / June 5, 2018 / © 2018 IBM Corporation