SlideShare a Scribd company logo
1 of 35
Productionizing
Spark ML Pipelines
with PFA
Nick Pentreath
Principal Engineer
@MLnick
DBG / June 5, 2018 / © 2018 IBM Corporation
About
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data &
AI Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
DBG / June 5, 2018 / © 2018 IBM Corporation
Center for Open Source Data
and AI Technologies
CODAIT
codait.org
DBG / June 5, 2018 / © 2018 IBM Corporation
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open Source
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
Agenda
The Machine Learning Workflow
Challenges of ML Deployment
Portable Format for Analytics
PFA for Spark ML
Performance Comparisons
Summary and Future Directions
DBG / June 5, 2018 / © 2018 IBM Corporation
The Machine Learning
Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
Perception
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
In reality the workflow spans teams …
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
… and tools …
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
… and is a small (but critical!)
piece of the puzzle
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
*Source: Hidden Technical Debt in Machine Learning Systems
Machine Learning
Deployment
DBG / June 5, 2018 / © 2018 IBM Corporation
Challenges
DBG / June 5, 2018 / © 2018 IBM Corporation
• Proliferation of formats
• Lack of standardization => custom solutions
• Existing standards have limitations
Machine Learning Deployment
• Need to manage and bridge many different:
• Languages - Python, R, Notebooks, Scala / Java / C
• Frameworks – too many to count!
• Dependencies
• Versions
• Performance characteristics highly variable
across these dimensions
• Friction between teams
• Data scientists & researchers – latest & greatest
• Production – stability, control, performance
• Business – metrics, business impact, working product!
Spark solves many problems …
Machine Learning Deployment
DBG / June 5, 2018 / © 2018 IBM Corporation
… but introduces additional challenges
DBG / June 5, 2018 / © 2018 IBM Corporation
• Currently, in order to use trained models
outside of Spark, users must:
• Write custom readers for Spark’s native format; or
• Create their own custom format; or
• Export to a standard format (limited support => custom
solution)
• Scoring outside of Spark also requires custom
translation layer between Spark and another
ML library
Everything is custom!
Machine Learning Deployment
• Tight coupling to Spark runtime
• Introduces complex dependencies
• Managing version & compatibility issues
• Scoring models in Spark is slow
• Overhead of DataFrames, especially query planning
• Overhead of task scheduling, even locally
• Optimized for batch scoring (includes streaming “micro-
batch” settings)
• Spark is not suitable for real-time scoring (<
few 100ms latency)
The Portable Format for
Analytics
DBG / June 5, 2018 / © 2018 IBM Corporation
Overview
DBG / June 5, 2018 / © 2018 IBM Corporation
• PFA is being championed by the Data Mining
Group (IBM is a founding member)
• DMG previously created PMML (Predictive
Model Markup Language), arguably the only
viable open standard currently
• PMML has many limitations
• PFA was created specifically to address these
shortcomings
• PFA consists of:
• JSON serialization format
• AVRO schemas for data types
• Encodes functions (actions) that are applied to inputs to
create outputs with a set of built-in functions and
language constructs (e.g. control-flow, conditionals)
• Essentially a mini functional math language + schema
specification
• Type and function system means PFA can be
fully & statically verified on load and run by
any compliant execution engine
• => portability across languages, frameworks,
run times and versions
Portable Format for Analytics
A Simple Example
DBG / June 5, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Example – multi-class logistic regression
• Specify input and output types using Avro
schemas
• Specify the action to perform (typically on
input)
Managing State
DBG / June 5, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Data storage specified by cells
• A cell is a named value acting as a global variable
• Typically used to store state (such as model
coefficients, vocabulary mappings, etc)
• Types specified with Avro schemas
• Cell values are mutable within an action, but immutable
between action executions of a given PFA document
• Persistent storage specified by pools
• Closer in concept to a database
• Pools values are mutable across action executions
Other Features
DBG / June 5, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Special forms
• Control structure – conditionals & loops
• Creating and manipulating local variables
• User-defined functions including lambdas
• Casts
• Null checks
• (Very) basic try-catch, user-defined errors and logs
• Comprehensive built-in function library
• Math, strings, arrays, maps, stats, linear algebra
• Built-in support for some common models - decision
tree, clustering, linear models
Aardpfark
DBG / June 5, 2018 / © 2018 IBM Corporation
PFA and Spark ML
• PFA export for Spark ML pipelines
• aardpfark-core: Scala DSL for creating PFA
documents
• avro4s to generate schemas from case classes;
json4s to serialize PFA document to JSON
• aardpfark-sparkml: uses DSL to export Spark ML
components and pipelines to PFA
Aardpfark - Challenges
DBG / June 5, 2018 / © 2018 IBM Corporation
PFA and Spark ML
• Spark ML Model has no schema knowledge
• E.g. Binarizer can operate on numeric or vector
columns
• Need to use Avro union types for standalone PFA
components and handle all cases in the action logic
• Combining components into a pipeline
• Trying to match Spark’s DataFrame-based input/output
behavior (typically appending columns)
• Each component is wrapped as a user-defined function
in the PFA document
• Current approach mimics passing a Row (i.e. Avro
record) from function to function, adding fields
• Missing features in PFA
• Generic vector support (mixed dense/sparse)
Related Open Standards
DBG / June 5, 2018 / © 2018 IBM Corporation
PMML
DBG / June 5, 2018 / © 2018 IBM Corporation
Related Open Standards
• Data Mining Group (DMG)
• Model interchange format in XML with
operators
• Widely used and supported; open standard
• Spark support lacking natively but 3rd party
projects available: jpmml-sparkml
• Comprehensive support for Spark ML components
(perhaps surprisingly!)
• Watch SPARK-11237
• Other exporters include scikit-learn, R,
XGBoost and LightGBM
• Shortcomings
• Cannot represent arbitrary programs / analytic
applications
• Flexibility comes from custom plugins => lose
benefits of standardization
MLeap
DBG / June 5, 2018 / © 2018 IBM Corporation
Related Open Standards
• Created by Combust.ML, a startup focused on
ML model serving
• Model interchange format in JSON / Protobuf
• Components implemented in Scala code
• Good performance
• Initially focused on Spark ML. Offers almost
complete support for Spark ML components
• Recently added some sklearn; working on
TensorFlow
• Shortcomings
• “Open” format, but not a “standard”
• No concept of well-defined operators / functions
• Effectively forces a tight coupling between
versions of model producer / consumer
• Must implement custom components in Scala
• Impossible to statically verify serialized model
Open Neural Network Exchange (ONNX)
DBG / June 5, 2018 / © 2018 IBM Corporation
• Championed by Facebook & Microsoft
• Protobuf serialization format
• Describes computation graph (including
operators)
• In this way the serialized graph is “self-describing”
similarly to PFA
• More focused on Deep Learning / tensor
operations
• Will be baked into PyTorch 1.0.0 / Caffe2 as
the serialization & interchange format
• Shortcomings
• No or poor support for more “traditional” ML or
language constructs (currently)
• Tree-based models & ensembles
• String / categorical processing
• Control flow
• Intermediate variables
Related Open Standards
Performance
DBG / June 5, 2018 / © 2018 IBM Corporation
Scoring Performance Comparison
DBG / June 5, 2018 / © 2018 IBM Corporation
• Comparing scoring performance of PFA with
Spark and MLeap
• PFA uses Hadrian reference implementation
for JVM
• Test dataset of ~80,000 records
• String indexing of 47 categorical columns
• Vector assembling the 47 categorical indices together
with 27 numerical columns
• Linear regression predictor
• Note: Spark time is 1.9s / record (1901ms) -
not shown on the chart
Performance
Conclusion
DBG / June 5, 2018 / © 2018 IBM Corporation
Summary
DBG / June 5, 2018 / © 2018 IBM Corporation
Conclusion
• PFA provides an open standard for
serialization and deployment of analytic
workflows
• True portability across languages, frameworks,
runtimes and versions
• Execution environment is independent of the producer
(R, scikit-learn, Spark ML, weka, etc)
• Solves a significant pain point for the
deployment of Spark ML pipelines
• Also benefits the wider ecosystem
• e.g. many currently use PMML for exporting models
from R, scikit-learn, XGBoost, LightGBM, etc.
• However there are risks
• PFA is still young and needs to gain adoption
• Performance in production, at scale, is relatively
untested
• Tests indicate PFA reference engines need some work
on robustness and performance
• What about Deep Learning / comparison to ONNX?
• Limitations of PFA
• A standard can move slowly in terms of new features,
fixes and enhancements
Open-sourcing Aardpfark
DBG / June 5, 2018 / © 2018 IBM Corporation
Conclusion
• Coverage
• Almost all predictors (ML models)
• Many feature transformers
• Pipeline support (still needs work)
• Equivalence tests Spark <-> PFA
• Tests for core Scala DSL
https://github.com/CODAIT/aardpfark
• Need your help!
• Finish implementing components
• Improve pipeline support
• Complete Scala DSL for PFA
• Python support
• Tests, docs, testing it out!
Future directions
DBG / June 5, 2018 / © 2018 IBM Corporation
Conclusion
• Initially focused on Spark ML pipelines
• Later add support for scikit-learn pipelines, XGBoost,
LightGBM, etc
• (Support for many R models exist already in the
Hadrian project)
• Further performance testing in progress vs
Spark & MLeap; also test vs PMML
• More automated translation (Scala -> PFA,
ASTs etc)
• Propose improvements to PFA
• Generic vector (tensor) support
• Less cumbersome schema definitions
• Performance improvements to scoring engine
• PFA for Deep Learning?
• Comparing to ONNX and other emerging standards
• Better suited for the more general pre-processing steps
of DL pipelines
• Requires all the various DL-specific operators
• Requires tensor schema and better tensor support
built-in to the PFA spec
• Should have GPU support
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
developer.ibm.com/code
DBG / June 5, 2018 / © 2018 IBM Corporation
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://datascience.ibm.com/
MAX
Links & References
Portable Format for Analytics
PMML
Spark MLlib – Saving and Loading Pipelines
Hadrian – Reference Implementation of PFA Engines for
JVM, Python, R
jpmml-sparkml
MLeap
Open Neural Network Exchange
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Tue, June 5 | 11 AM
2010-2012, 30 mins
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Nick Pentreath (IBM)
Tue, June 5 | 2 PM
2018, 30 mins
Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing!
Holden Karau (Google) Bryan Cutler (IBM)
Tue, June 5 | 2 PM
Nook by 2001, 30 mins
Making Data and AI Accessible for All
Armand Ruiz Gabernet (IBM)
Tue, June 5 | 2:40 PM
2002-2004, 30 mins
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System
Rajesh Bordawekar (IBM T.J. Watson Research Center)
Tue, June 5 | 3:20 PM
3016-3022, 30 mins
Dynamic Priorities for Apache Spark Application’s Resource Allocations
Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.)
Tue, June 5 | 3:20 PM
2001-2005, 30 mins
Model Parallelism in Spark ML Cross-Validation
Nick Pentreath (IBM) Bryan Cutler (IBM)
Tue, June 5 | 3:20 PM
2007, 30 mins
Serverless Machine Learning on Modern Hardware Using Apache Spark
Patrick Stuedi (IBM)
Tue, June 5 | 5:40 PM
2002-2004, 30 mins
Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine;
Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida)
Tue, June 5 | 5:40 PM
2007, 30 mins
Transparent GPU Exploitation on Apache Spark
Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM)
Tue, June 5 | 5:40 PM
2009-2011, 30 mins
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks
Yonggang Hu (IBM) Chao Xue (IBM)
IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More
Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP)
Wed, June 6 | 2 PM
2002-2004, 30 mins
Deep Learning for Recommender Systems
Nick Pentreath (IBM) )
Wed, June 6 | 3:20 PM
2018, 30 mins
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer
Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies)
IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6)
DBG / June 5, 2018 / © 2018 IBM Corporation
Meet us at IBM booth in the Expo area.
DBG / June 5, 2018 / © 2018 IBM Corporation

More Related Content

What's hot

Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningLviv Startup Club
 
How to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysqlHow to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysqlBryan Downing
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Group, Inc.
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Pentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare RuleenginePentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare Ruleengineuwe geercken
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...Flink Forward
 
Next18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to youNext18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to youMárton Kodok
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...Databricks
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Flink Forward
 

What's hot (20)

Image Management at Ebates
Image Management at EbatesImage Management at Ebates
Image Management at Ebates
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine Learning
 
How to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysqlHow to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysql
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Pentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare RuleenginePentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare Ruleengine
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
 
The HANA Cloud Platform
The HANA Cloud PlatformThe HANA Cloud Platform
The HANA Cloud Platform
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
 
Next18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to youNext18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to you
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
 

Similar to Productionizing Spark ML Pipelines with PFA

Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Databricks
 
Productionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsProductionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsDataWorks Summit
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathChester Chen
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesNick Pentreath
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeNick Pentreath
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...Amazon Web Services
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
iSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java MigrationiSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java Migrationecubemarketing
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance ObservationsAdam Roberts
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016Tim Ellison
 
FDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real StoryFDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real StoryJoseph Alaimo Jr
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Amazon Web Services
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
 
Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0Charles Rivet
 

Similar to Productionizing Spark ML Pipelines with PFA (20)

Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
 
Productionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsProductionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analytics
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
iSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java MigrationiSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java Migration
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
FDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real StoryFDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real Story
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0
 

More from Nick Pentreath

Scaling up deep learning by scaling down
Scaling up deep learning by scaling downScaling up deep learning by scaling down
Scaling up deep learning by scaling downNick Pentreath
 
End-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNXEnd-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNXNick Pentreath
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneNick Pentreath
 
Search and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinSearch and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinNick Pentreath
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsNick Pentreath
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationNick Pentreath
 

More from Nick Pentreath (6)

Scaling up deep learning by scaling down
Scaling up deep learning by scaling downScaling up deep learning by scaling down
Scaling up deep learning by scaling down
 
End-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNXEnd-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNX
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for Everyone
 
Search and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinSearch and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same Coin
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and Personalization
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Productionizing Spark ML Pipelines with PFA

  • 1. Productionizing Spark ML Pipelines with PFA Nick Pentreath Principal Engineer @MLnick DBG / June 5, 2018 / © 2018 IBM Corporation
  • 2. About @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups DBG / June 5, 2018 / © 2018 IBM Corporation
  • 3. Center for Open Source Data and AI Technologies CODAIT codait.org DBG / June 5, 2018 / © 2018 IBM Corporation CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source Gather Data Analyze Data Machine Learning Deep Learning Deploy Model Maintain Model Python Data Science Stack Fabric for Deep Learning (FfDL) Mleap + PFA Scikit-LearnPandas Apache Spark Apache Spark Jupyter Model Asset eXchange Keras + Tensorflow
  • 4. Agenda The Machine Learning Workflow Challenges of ML Deployment Portable Format for Analytics PFA for Spark ML Performance Comparisons Summary and Future Directions DBG / June 5, 2018 / © 2018 IBM Corporation
  • 5. The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 6. Perception The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 7. In reality the workflow spans teams … The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 8. … and tools … The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 9. … and is a small (but critical!) piece of the puzzle The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation *Source: Hidden Technical Debt in Machine Learning Systems
  • 10. Machine Learning Deployment DBG / June 5, 2018 / © 2018 IBM Corporation
  • 11. Challenges DBG / June 5, 2018 / © 2018 IBM Corporation • Proliferation of formats • Lack of standardization => custom solutions • Existing standards have limitations Machine Learning Deployment • Need to manage and bridge many different: • Languages - Python, R, Notebooks, Scala / Java / C • Frameworks – too many to count! • Dependencies • Versions • Performance characteristics highly variable across these dimensions • Friction between teams • Data scientists & researchers – latest & greatest • Production – stability, control, performance • Business – metrics, business impact, working product!
  • 12. Spark solves many problems … Machine Learning Deployment DBG / June 5, 2018 / © 2018 IBM Corporation
  • 13. … but introduces additional challenges DBG / June 5, 2018 / © 2018 IBM Corporation • Currently, in order to use trained models outside of Spark, users must: • Write custom readers for Spark’s native format; or • Create their own custom format; or • Export to a standard format (limited support => custom solution) • Scoring outside of Spark also requires custom translation layer between Spark and another ML library Everything is custom! Machine Learning Deployment • Tight coupling to Spark runtime • Introduces complex dependencies • Managing version & compatibility issues • Scoring models in Spark is slow • Overhead of DataFrames, especially query planning • Overhead of task scheduling, even locally • Optimized for batch scoring (includes streaming “micro- batch” settings) • Spark is not suitable for real-time scoring (< few 100ms latency)
  • 14. The Portable Format for Analytics DBG / June 5, 2018 / © 2018 IBM Corporation
  • 15. Overview DBG / June 5, 2018 / © 2018 IBM Corporation • PFA is being championed by the Data Mining Group (IBM is a founding member) • DMG previously created PMML (Predictive Model Markup Language), arguably the only viable open standard currently • PMML has many limitations • PFA was created specifically to address these shortcomings • PFA consists of: • JSON serialization format • AVRO schemas for data types • Encodes functions (actions) that are applied to inputs to create outputs with a set of built-in functions and language constructs (e.g. control-flow, conditionals) • Essentially a mini functional math language + schema specification • Type and function system means PFA can be fully & statically verified on load and run by any compliant execution engine • => portability across languages, frameworks, run times and versions Portable Format for Analytics
  • 16. A Simple Example DBG / June 5, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Example – multi-class logistic regression • Specify input and output types using Avro schemas • Specify the action to perform (typically on input)
  • 17. Managing State DBG / June 5, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Data storage specified by cells • A cell is a named value acting as a global variable • Typically used to store state (such as model coefficients, vocabulary mappings, etc) • Types specified with Avro schemas • Cell values are mutable within an action, but immutable between action executions of a given PFA document • Persistent storage specified by pools • Closer in concept to a database • Pools values are mutable across action executions
  • 18. Other Features DBG / June 5, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Special forms • Control structure – conditionals & loops • Creating and manipulating local variables • User-defined functions including lambdas • Casts • Null checks • (Very) basic try-catch, user-defined errors and logs • Comprehensive built-in function library • Math, strings, arrays, maps, stats, linear algebra • Built-in support for some common models - decision tree, clustering, linear models
  • 19. Aardpfark DBG / June 5, 2018 / © 2018 IBM Corporation PFA and Spark ML • PFA export for Spark ML pipelines • aardpfark-core: Scala DSL for creating PFA documents • avro4s to generate schemas from case classes; json4s to serialize PFA document to JSON • aardpfark-sparkml: uses DSL to export Spark ML components and pipelines to PFA
  • 20. Aardpfark - Challenges DBG / June 5, 2018 / © 2018 IBM Corporation PFA and Spark ML • Spark ML Model has no schema knowledge • E.g. Binarizer can operate on numeric or vector columns • Need to use Avro union types for standalone PFA components and handle all cases in the action logic • Combining components into a pipeline • Trying to match Spark’s DataFrame-based input/output behavior (typically appending columns) • Each component is wrapped as a user-defined function in the PFA document • Current approach mimics passing a Row (i.e. Avro record) from function to function, adding fields • Missing features in PFA • Generic vector support (mixed dense/sparse)
  • 21. Related Open Standards DBG / June 5, 2018 / © 2018 IBM Corporation
  • 22. PMML DBG / June 5, 2018 / © 2018 IBM Corporation Related Open Standards • Data Mining Group (DMG) • Model interchange format in XML with operators • Widely used and supported; open standard • Spark support lacking natively but 3rd party projects available: jpmml-sparkml • Comprehensive support for Spark ML components (perhaps surprisingly!) • Watch SPARK-11237 • Other exporters include scikit-learn, R, XGBoost and LightGBM • Shortcomings • Cannot represent arbitrary programs / analytic applications • Flexibility comes from custom plugins => lose benefits of standardization
  • 23. MLeap DBG / June 5, 2018 / © 2018 IBM Corporation Related Open Standards • Created by Combust.ML, a startup focused on ML model serving • Model interchange format in JSON / Protobuf • Components implemented in Scala code • Good performance • Initially focused on Spark ML. Offers almost complete support for Spark ML components • Recently added some sklearn; working on TensorFlow • Shortcomings • “Open” format, but not a “standard” • No concept of well-defined operators / functions • Effectively forces a tight coupling between versions of model producer / consumer • Must implement custom components in Scala • Impossible to statically verify serialized model
  • 24. Open Neural Network Exchange (ONNX) DBG / June 5, 2018 / © 2018 IBM Corporation • Championed by Facebook & Microsoft • Protobuf serialization format • Describes computation graph (including operators) • In this way the serialized graph is “self-describing” similarly to PFA • More focused on Deep Learning / tensor operations • Will be baked into PyTorch 1.0.0 / Caffe2 as the serialization & interchange format • Shortcomings • No or poor support for more “traditional” ML or language constructs (currently) • Tree-based models & ensembles • String / categorical processing • Control flow • Intermediate variables Related Open Standards
  • 25. Performance DBG / June 5, 2018 / © 2018 IBM Corporation
  • 26. Scoring Performance Comparison DBG / June 5, 2018 / © 2018 IBM Corporation • Comparing scoring performance of PFA with Spark and MLeap • PFA uses Hadrian reference implementation for JVM • Test dataset of ~80,000 records • String indexing of 47 categorical columns • Vector assembling the 47 categorical indices together with 27 numerical columns • Linear regression predictor • Note: Spark time is 1.9s / record (1901ms) - not shown on the chart Performance
  • 27. Conclusion DBG / June 5, 2018 / © 2018 IBM Corporation
  • 28. Summary DBG / June 5, 2018 / © 2018 IBM Corporation Conclusion • PFA provides an open standard for serialization and deployment of analytic workflows • True portability across languages, frameworks, runtimes and versions • Execution environment is independent of the producer (R, scikit-learn, Spark ML, weka, etc) • Solves a significant pain point for the deployment of Spark ML pipelines • Also benefits the wider ecosystem • e.g. many currently use PMML for exporting models from R, scikit-learn, XGBoost, LightGBM, etc. • However there are risks • PFA is still young and needs to gain adoption • Performance in production, at scale, is relatively untested • Tests indicate PFA reference engines need some work on robustness and performance • What about Deep Learning / comparison to ONNX? • Limitations of PFA • A standard can move slowly in terms of new features, fixes and enhancements
  • 29. Open-sourcing Aardpfark DBG / June 5, 2018 / © 2018 IBM Corporation Conclusion • Coverage • Almost all predictors (ML models) • Many feature transformers • Pipeline support (still needs work) • Equivalence tests Spark <-> PFA • Tests for core Scala DSL https://github.com/CODAIT/aardpfark • Need your help! • Finish implementing components • Improve pipeline support • Complete Scala DSL for PFA • Python support • Tests, docs, testing it out!
  • 30. Future directions DBG / June 5, 2018 / © 2018 IBM Corporation Conclusion • Initially focused on Spark ML pipelines • Later add support for scikit-learn pipelines, XGBoost, LightGBM, etc • (Support for many R models exist already in the Hadrian project) • Further performance testing in progress vs Spark & MLeap; also test vs PMML • More automated translation (Scala -> PFA, ASTs etc) • Propose improvements to PFA • Generic vector (tensor) support • Less cumbersome schema definitions • Performance improvements to scoring engine • PFA for Deep Learning? • Comparing to ONNX and other emerging standards • Better suited for the more general pre-processing steps of DL pipelines • Requires all the various DL-specific operators • Requires tensor schema and better tensor support built-in to the PFA spec • Should have GPU support
  • 31. Thank you! codait.org twitter.com/MLnick github.com/MLnick developer.ibm.com/code DBG / June 5, 2018 / © 2018 IBM Corporation FfDL Sign up for IBM Cloud and try Watson Studio! https://datascience.ibm.com/ MAX
  • 32. Links & References Portable Format for Analytics PMML Spark MLlib – Saving and Loading Pipelines Hadrian – Reference Implementation of PFA Engines for JVM, Python, R jpmml-sparkml MLeap Open Neural Network Exchange DBG / June 5, 2018 / © 2018 IBM Corporation
  • 33. Date, Time, Location & Duration Session title and Speaker Tue, June 5 | 11 AM 2010-2012, 30 mins Productionizing Spark ML Pipelines with the Portable Format for Analytics Nick Pentreath (IBM) Tue, June 5 | 2 PM 2018, 30 mins Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing! Holden Karau (Google) Bryan Cutler (IBM) Tue, June 5 | 2 PM Nook by 2001, 30 mins Making Data and AI Accessible for All Armand Ruiz Gabernet (IBM) Tue, June 5 | 2:40 PM 2002-2004, 30 mins Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System Rajesh Bordawekar (IBM T.J. Watson Research Center) Tue, June 5 | 3:20 PM 3016-3022, 30 mins Dynamic Priorities for Apache Spark Application’s Resource Allocations Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.) Tue, June 5 | 3:20 PM 2001-2005, 30 mins Model Parallelism in Spark ML Cross-Validation Nick Pentreath (IBM) Bryan Cutler (IBM) Tue, June 5 | 3:20 PM 2007, 30 mins Serverless Machine Learning on Modern Hardware Using Apache Spark Patrick Stuedi (IBM) Tue, June 5 | 5:40 PM 2002-2004, 30 mins Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine; Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida) Tue, June 5 | 5:40 PM 2007, 30 mins Transparent GPU Exploitation on Apache Spark Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM) Tue, June 5 | 5:40 PM 2009-2011, 30 mins Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks Yonggang Hu (IBM) Chao Xue (IBM) IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5) DBG / June 5, 2018 / © 2018 IBM Corporation
  • 34. Date, Time, Location & Duration Session title and Speaker Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP) Wed, June 6 | 2 PM 2002-2004, 30 mins Deep Learning for Recommender Systems Nick Pentreath (IBM) ) Wed, June 6 | 3:20 PM 2018, 30 mins Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies) IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6) DBG / June 5, 2018 / © 2018 IBM Corporation Meet us at IBM booth in the Expo area.
  • 35. DBG / June 5, 2018 / © 2018 IBM Corporation