SlideShare a Scribd company logo
1 of 48
Download to read offline
@s_kontopoulos
Machine Learning at Scale: Challenges
and Solutions
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
@s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare stavroskontopoulos
stavroskontopoulos
All trademarks and registered trademarks are property of their respective holders.
@s_kontopoulos
Agenda
- ML in the Enterprise
- ML from development to production
- Key technologies: Apache Spark as a case study
3
@s_kontopoulos
ML in the Enterprise
ML is a key tool that fuels the effort of coupling business monitoring (BI) with
predictive and prescriptive analytics.
business insights -> business optimization -> data monetization
4
@s_kontopoulos
ML in the Enterprise - The Data-Science LifeCycle
Identify Business Question
Identify and collect related Data
Data cleansing, feature extraction (Data pre-processing)
Experiment planning
Model Building
Model Evaluation
Model Deployment/Management in Production
Model Optimization - Performance
5
@s_kontopoulos
Machine Learning Model
A model is a function that maps inputs to outputs and essentially expresses a
mathematical abstraction.
Linear Regression:
Neural Network:
Random Forest:
Function composition
6
@s_kontopoulos
Model Evolution
- Models can be either pre-computed eg. trained off-line or updated on-line.
- Online ML with Streaming:
- Pure online means only use the latest arrived data point to update the model. Usually models
are updated per batch/window eg. online k-means though.
- An interesting case is when we sample the stream and train a model only when the distribution
changes.
- Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling
- Re-train the model by ignoring the previous one.
7
@s_kontopoulos
Machine Learning Pipeline
Machine learning pipeline in Production: describes all steps from data
preprocessing before feeding the model to model output processing
(post-processing).
8
@s_kontopoulos
Machine Learning Pipeline in Libraries
Pros:
- Data and test data go through the same steps
- Like a CI (continuous integration) pipeline people can reason about data
transformation
- Caching of computations
- Model serving easier 9
@s_kontopoulos
Multiple Models in a Pipeline
Within the same pipeline it is also possible to run multiple models:
a) Model Segmentation
b) Model Ensemble
c) Model Chaining
d) Model Composition
http://dmg.org/pmml/v4-1/MultipleModels.html
http://dl.acm.org/citation.cfm?id=1859403
10
@s_kontopoulos
Model Development & Production
Data Scientist
All trademarks and registered trademarks are property of their respective holders.
GO
Data Engineer
11
@s_kontopoulos
Model Standardization
12
ML Framework Model Definition
Evaluation
Data
Predictions
Export Import
PFA - Portable
Format For
Analytics
@s_kontopoulos
Model Standardization
13
- PFA or PMML won’t break the pipeline. PFA is more flexible than PMML.
“Unlike PMML, PFA has control structures to direct program flow, a true type system for both
model parameters and data, and its statistical functions are much more finely grained and can
accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/)
- Custom model definitions and implementations are more flexible or more
optimized but could break the pipeline.
- Some Implementations:
- https://github.com/jpmml/jpmml-evaluator-spark
- https://github.com/jpmml
- https://github.com/opendatagroup/hadrian
@s_kontopoulos
Model Lifecycle
Some concerns about model lifecycle:
- Model evolution
- Model release practices
- Model versioning
- Model update process
14
@s_kontopoulos
Model Governance
● governed by the company’s policies and procedures, laws and regulations
and organization’s goals
● searchable across company
● be transparent, explainable, traceable and interpretable for auditors and
regulators. Example GDPR requirements:
https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-
the-gdpr/
● have approval and release process
15
@s_kontopoulos
Model Server
“A model server is a system which handles the lifecycle of a model and provides
the required APIs for deploying a model/pipeline.”
Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/
CLIPPER Tensorflow Serving
16
@s_kontopoulos
Model Serving - Requirements
Other requirements:
- Response time - time to calculate a prediction. Could be a few mills.
- Throughput - predictions per second.
- Support for running multiple models (very common to run hundreds of models
eg. A telecom operator where there is one model per customer or in IoT one
model per site/sensor).
17
@s_kontopoulos
Model Serving - Requirements
- multiple versions of the same machine learning pipeline within the system.
One reason can be A/B testing.
- Model update- How quickly and easy a model can be updated?
- Uptime/reliability
18
@s_kontopoulos
Tensorflow Serving Issues
Not all systems cover the requirements. For example:
● Metadata not available. (https://github.com/tensorflow/serving/issues/612)
● No new models at runtime: (https://github.com/tensorflow/serving/issues/422)
● Can be hard to build from scratch:
https://github.com/tensorflow/serving/issues/327
19
@s_kontopoulos
Model Serving with Apache Flink
Apache Flink: Low latency compared to Spark streaming engine based on the
Beam model.
20
@s_kontopoulos
Model Serving with Apache Flink
Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline
models loaded from a permanent storage and update them without interruption.
FLIP Proposal:
(https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8
oGRPsPuk8)
Combines different efforts: https://github.com/FlinkML
● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/)
● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky)
● https://github.com/FlinkML/flink-tensorflow (Eron Wright)
21
@s_kontopoulos
Model Serving with Apache Flink
22
Use a control stream and a data Stream. Keep model in operator’s state. Join the streams.
Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and
partitions-based join based on RichCoFlatMapFunction.
@s_kontopoulos
Model Serving with Apache Flink
23
More here:
https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
@s_kontopoulos
Data Lakes
How can we work with data to cover future needs and use cases. We need a
robust ML framework plus flexible infrastructure. Data Warehouses will not work.
Data lake to the rescue.
“A data lake is a method of storing data within a system or repository, in its natural
format, that facilitates the collocation of data in various schemata and structural
forms, usually object blobs or files.”
- Wikipedia
24
@s_kontopoulos
Data Lakes
● Agility. It can be seen as a tool that makes data accessible to different users
and facilitates ML.
● Designed for low-cost storage
● Schema on read
● Security and governance still maturing.
25
@s_kontopoulos
Data Lake Issues
“Through 2018, 80% of data lakes will not include effective metadata management
capabilities, making them inefficient.”
- Gartner
Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM
Watson Platform etc.
26
@s_kontopoulos
Notebooks
Very convenient for the data scientist or the analyst.
Production usually is based on traditional deployment methods.
- Spark Notebook
- Apache zeppelin
- Jupyter
27
@s_kontopoulos
ML with Apache Spark
“A popular big data framework for ML and data-science.”
- You can work locally and move to production fast
- ETL/Feature Engineering
- Hyper-parameter tuning
- Rich Model support
- Multiple language support (Scala, Java, Python, R)
28
@s_kontopoulos
Apache Spark - Intro
29
A framework for distributed in-memory data processing.
@s_kontopoulos
Apache Spark - Intro
- User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG.
- DAG is shipped to nodes where the data lie, computation is executed and
results are sent back to the user.
- The data-sets are considered as immutable distributed data (RDDs).
- Resilient Distributed Datasets (RDD) an immutable distributed
collection of objects.
30
@s_kontopoulos
Apache Spark - Basic Example in Scala
31
basic statistics, a hello world
for ML
@s_kontopoulos
Apache Spark - Intro
There are three APIs: RDD, DataFrames, Datasets
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat
aframes-and-datasets.html
32
RDD DataFrames (SQL) Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
@s_kontopoulos
Apache Spark - Intro
“Datasets support encoders which allow to map semi-structured formats (eg
JSON) to constructs of type safe languages (Scala, Java). Also they have better
performance compared to java serialization or kryo.”
33
@s_kontopoulos
MLliB
A library for machine learning on top of Spark. Has two APIs:
- RDD based (spark.mllib).
- Datasets / Dataframes based (spark.ml).
The latter is relatively new and makes it easier to construct a ML pipeline or run an
algorithm. The first is older with more features.
34
@s_kontopoulos
MLliB
“As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered
maintenance mode. “
What are the implications?
● MLlib will still support the RDD-based API in spark.mllib with bug fixes.
● MLlib will not add new features to the RDD-based API.
● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
● The RDD-based API is expected to be removed in Spark 3.0.
35
@s_kontopoulos
MLliB
Supports different categories of ML algorithms:
● Basic statistics (correlations etc)
● Pipelines (LSH, TF-IDF)
● Extracting, transforming and selecting features
● Classification and Regression (Random forests, Gradient boosted trees)
● Clustering (K-means, LDA, etc)
● Collaborative filtering
● Frequent Pattern Mining
● Model selection and tuning
Allows to implement: Fraud detection, Recommendation engines,...
36
@s_kontopoulos
MLliB Local
A new package is available for production use of the algorithms without the need
of Spark itself. How about PMML vs this method?
https://issues.apache.org/jira/browse/SPARK-13944
https://issues.apache.org/jira/browse/SPARK-16365
37
@s_kontopoulos
MLliB - Unsupervised Learning Example
Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data
Describes wine quality. Different dimensions like: chlorides, sugar etc.
We will apply k-means to identify different clusters of wine quality.
Implemented both mllib and ml implementations as spark notebooks.
38
Normalize Data K-means PCA Visualize
@s_kontopoulos
MLliB - Unsupervised Learning Example
39
parse data
train k-means with different k
@s_kontopoulos
MLliB - Unsupervised Learning Example
40
Counting errors for elbow method
@s_kontopoulos
MLLiB - Unsupervised Learning Example
41
PCA analysis to verify k-means
with k=2
@s_kontopoulos
MLLiB - Unsupervised Learning Example
42
PCA K=2
@s_kontopoulos
MLliB - Unsupervised Learning Example
43
Available with the mllib implementation
@s_kontopoulos
Spark Deep Learning Pipelines
- People know SQL
- Models are productized as SQL UDFS.
Predictions as a SQL statement:
SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table
https://github.com/databricks/spark-deep-learning
44
@s_kontopoulos
BigDL
● Developed by Intel.
● It does not use GPUs, optimized for Intel processors.
“It is orders of magnitude faster than out-of-box open source Caffe, Torch or
TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).”
● It is implemented as a standalone package on Spark.
● Can be used with existing Spark or Hadoop clusters.
● High-performance powered by Intel MKL and multi-threaded programming.
● Easily scaled-out
● Appropriate for users who are not DL experts.
45
@s_kontopoulos
BigDL
● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and
testing machine learning models.
● A lot of useful features: Loss Functions, Layers support etc
● Implements a parameter server for distributed training of DL models
● Support visualization via tensorboard:
https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb
oard
46
@s_kontopoulos
BigDL in practice
For a cool example of using BigDL on mesos check our blog:
http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/
47
@s_kontopoulos
Thank you! Questions?
https://github.com/skonto/talks/blob/master/big-data-italy-2017/ml/references.md
48

More Related Content

What's hot

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesDatabricks
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataDatabricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache SparkGeospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkDatabricks
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 

What's hot (19)

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache SparkGeospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache Spark
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 

Similar to Machine learning at scale challenges and solutions

ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoDB Database
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfM Waleed Kadous
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleDeep Kayal
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Enabling the digital thread using open OSLC standards
Enabling the digital thread using open OSLC standardsEnabling the digital thread using open OSLC standards
Enabling the digital thread using open OSLC standardsAxel Reichwein
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsGianmario Spacagna
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
databricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringdatabricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringMohamed MEJDOUBI
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 

Similar to Machine learning at scale challenges and solutions (20)

ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at Scale
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Enabling the digital thread using open OSLC standards
Enabling the digital thread using open OSLC standardsEnabling the digital thread using open OSLC standards
Enabling the digital thread using open OSLC standards
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Tutorial4
Tutorial4Tutorial4
Tutorial4
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
databricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringdatabricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineering
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 

More from Stavros Kontopoulos

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfStavros Kontopoulos
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsStavros Kontopoulos
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...Stavros Kontopoulos
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkStavros Kontopoulos
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Stavros Kontopoulos
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 

More from Stavros Kontopoulos (10)

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on Flink
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Recently uploaded

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Recently uploaded (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Machine learning at scale challenges and solutions

  • 1. @s_kontopoulos Machine Learning at Scale: Challenges and Solutions Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc.
  • 2. @s_kontopoulos Who am I? 2 skonto s_kontopoulos S. Software Engineer @ Lightbend, Fast Data Team Apache Flink Contributor at SlideShare stavroskontopoulos stavroskontopoulos All trademarks and registered trademarks are property of their respective holders.
  • 3. @s_kontopoulos Agenda - ML in the Enterprise - ML from development to production - Key technologies: Apache Spark as a case study 3
  • 4. @s_kontopoulos ML in the Enterprise ML is a key tool that fuels the effort of coupling business monitoring (BI) with predictive and prescriptive analytics. business insights -> business optimization -> data monetization 4
  • 5. @s_kontopoulos ML in the Enterprise - The Data-Science LifeCycle Identify Business Question Identify and collect related Data Data cleansing, feature extraction (Data pre-processing) Experiment planning Model Building Model Evaluation Model Deployment/Management in Production Model Optimization - Performance 5
  • 6. @s_kontopoulos Machine Learning Model A model is a function that maps inputs to outputs and essentially expresses a mathematical abstraction. Linear Regression: Neural Network: Random Forest: Function composition 6
  • 7. @s_kontopoulos Model Evolution - Models can be either pre-computed eg. trained off-line or updated on-line. - Online ML with Streaming: - Pure online means only use the latest arrived data point to update the model. Usually models are updated per batch/window eg. online k-means though. - An interesting case is when we sample the stream and train a model only when the distribution changes. - Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling - Re-train the model by ignoring the previous one. 7
  • 8. @s_kontopoulos Machine Learning Pipeline Machine learning pipeline in Production: describes all steps from data preprocessing before feeding the model to model output processing (post-processing). 8
  • 9. @s_kontopoulos Machine Learning Pipeline in Libraries Pros: - Data and test data go through the same steps - Like a CI (continuous integration) pipeline people can reason about data transformation - Caching of computations - Model serving easier 9
  • 10. @s_kontopoulos Multiple Models in a Pipeline Within the same pipeline it is also possible to run multiple models: a) Model Segmentation b) Model Ensemble c) Model Chaining d) Model Composition http://dmg.org/pmml/v4-1/MultipleModels.html http://dl.acm.org/citation.cfm?id=1859403 10
  • 11. @s_kontopoulos Model Development & Production Data Scientist All trademarks and registered trademarks are property of their respective holders. GO Data Engineer 11
  • 12. @s_kontopoulos Model Standardization 12 ML Framework Model Definition Evaluation Data Predictions Export Import PFA - Portable Format For Analytics
  • 13. @s_kontopoulos Model Standardization 13 - PFA or PMML won’t break the pipeline. PFA is more flexible than PMML. “Unlike PMML, PFA has control structures to direct program flow, a true type system for both model parameters and data, and its statistical functions are much more finely grained and can accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/) - Custom model definitions and implementations are more flexible or more optimized but could break the pipeline. - Some Implementations: - https://github.com/jpmml/jpmml-evaluator-spark - https://github.com/jpmml - https://github.com/opendatagroup/hadrian
  • 14. @s_kontopoulos Model Lifecycle Some concerns about model lifecycle: - Model evolution - Model release practices - Model versioning - Model update process 14
  • 15. @s_kontopoulos Model Governance ● governed by the company’s policies and procedures, laws and regulations and organization’s goals ● searchable across company ● be transparent, explainable, traceable and interpretable for auditors and regulators. Example GDPR requirements: https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in- the-gdpr/ ● have approval and release process 15
  • 16. @s_kontopoulos Model Server “A model server is a system which handles the lifecycle of a model and provides the required APIs for deploying a model/pipeline.” Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/ CLIPPER Tensorflow Serving 16
  • 17. @s_kontopoulos Model Serving - Requirements Other requirements: - Response time - time to calculate a prediction. Could be a few mills. - Throughput - predictions per second. - Support for running multiple models (very common to run hundreds of models eg. A telecom operator where there is one model per customer or in IoT one model per site/sensor). 17
  • 18. @s_kontopoulos Model Serving - Requirements - multiple versions of the same machine learning pipeline within the system. One reason can be A/B testing. - Model update- How quickly and easy a model can be updated? - Uptime/reliability 18
  • 19. @s_kontopoulos Tensorflow Serving Issues Not all systems cover the requirements. For example: ● Metadata not available. (https://github.com/tensorflow/serving/issues/612) ● No new models at runtime: (https://github.com/tensorflow/serving/issues/422) ● Can be hard to build from scratch: https://github.com/tensorflow/serving/issues/327 19
  • 20. @s_kontopoulos Model Serving with Apache Flink Apache Flink: Low latency compared to Spark streaming engine based on the Beam model. 20
  • 21. @s_kontopoulos Model Serving with Apache Flink Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline models loaded from a permanent storage and update them without interruption. FLIP Proposal: (https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8 oGRPsPuk8) Combines different efforts: https://github.com/FlinkML ● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/) ● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky) ● https://github.com/FlinkML/flink-tensorflow (Eron Wright) 21
  • 22. @s_kontopoulos Model Serving with Apache Flink 22 Use a control stream and a data Stream. Keep model in operator’s state. Join the streams. Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and partitions-based join based on RichCoFlatMapFunction.
  • 23. @s_kontopoulos Model Serving with Apache Flink 23 More here: https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
  • 24. @s_kontopoulos Data Lakes How can we work with data to cover future needs and use cases. We need a robust ML framework plus flexible infrastructure. Data Warehouses will not work. Data lake to the rescue. “A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.” - Wikipedia 24
  • 25. @s_kontopoulos Data Lakes ● Agility. It can be seen as a tool that makes data accessible to different users and facilitates ML. ● Designed for low-cost storage ● Schema on read ● Security and governance still maturing. 25
  • 26. @s_kontopoulos Data Lake Issues “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” - Gartner Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM Watson Platform etc. 26
  • 27. @s_kontopoulos Notebooks Very convenient for the data scientist or the analyst. Production usually is based on traditional deployment methods. - Spark Notebook - Apache zeppelin - Jupyter 27
  • 28. @s_kontopoulos ML with Apache Spark “A popular big data framework for ML and data-science.” - You can work locally and move to production fast - ETL/Feature Engineering - Hyper-parameter tuning - Rich Model support - Multiple language support (Scala, Java, Python, R) 28
  • 29. @s_kontopoulos Apache Spark - Intro 29 A framework for distributed in-memory data processing.
  • 30. @s_kontopoulos Apache Spark - Intro - User defines computations/operations (map, flatMap etc) on the data-sets (bounded or not) as a DAG. - DAG is shipped to nodes where the data lie, computation is executed and results are sent back to the user. - The data-sets are considered as immutable distributed data (RDDs). - Resilient Distributed Datasets (RDD) an immutable distributed collection of objects. 30
  • 31. @s_kontopoulos Apache Spark - Basic Example in Scala 31 basic statistics, a hello world for ML
  • 32. @s_kontopoulos Apache Spark - Intro There are three APIs: RDD, DataFrames, Datasets https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat aframes-and-datasets.html 32 RDD DataFrames (SQL) Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time
  • 33. @s_kontopoulos Apache Spark - Intro “Datasets support encoders which allow to map semi-structured formats (eg JSON) to constructs of type safe languages (Scala, Java). Also they have better performance compared to java serialization or kryo.” 33
  • 34. @s_kontopoulos MLliB A library for machine learning on top of Spark. Has two APIs: - RDD based (spark.mllib). - Datasets / Dataframes based (spark.ml). The latter is relatively new and makes it easier to construct a ML pipeline or run an algorithm. The first is older with more features. 34
  • 35. @s_kontopoulos MLliB “As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. “ What are the implications? ● MLlib will still support the RDD-based API in spark.mllib with bug fixes. ● MLlib will not add new features to the RDD-based API. ● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. ● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated. ● The RDD-based API is expected to be removed in Spark 3.0. 35
  • 36. @s_kontopoulos MLliB Supports different categories of ML algorithms: ● Basic statistics (correlations etc) ● Pipelines (LSH, TF-IDF) ● Extracting, transforming and selecting features ● Classification and Regression (Random forests, Gradient boosted trees) ● Clustering (K-means, LDA, etc) ● Collaborative filtering ● Frequent Pattern Mining ● Model selection and tuning Allows to implement: Fraud detection, Recommendation engines,... 36
  • 37. @s_kontopoulos MLliB Local A new package is available for production use of the algorithms without the need of Spark itself. How about PMML vs this method? https://issues.apache.org/jira/browse/SPARK-13944 https://issues.apache.org/jira/browse/SPARK-16365 37
  • 38. @s_kontopoulos MLliB - Unsupervised Learning Example Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data Describes wine quality. Different dimensions like: chlorides, sugar etc. We will apply k-means to identify different clusters of wine quality. Implemented both mllib and ml implementations as spark notebooks. 38 Normalize Data K-means PCA Visualize
  • 39. @s_kontopoulos MLliB - Unsupervised Learning Example 39 parse data train k-means with different k
  • 40. @s_kontopoulos MLliB - Unsupervised Learning Example 40 Counting errors for elbow method
  • 41. @s_kontopoulos MLLiB - Unsupervised Learning Example 41 PCA analysis to verify k-means with k=2
  • 42. @s_kontopoulos MLLiB - Unsupervised Learning Example 42 PCA K=2
  • 43. @s_kontopoulos MLliB - Unsupervised Learning Example 43 Available with the mllib implementation
  • 44. @s_kontopoulos Spark Deep Learning Pipelines - People know SQL - Models are productized as SQL UDFS. Predictions as a SQL statement: SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table https://github.com/databricks/spark-deep-learning 44
  • 45. @s_kontopoulos BigDL ● Developed by Intel. ● It does not use GPUs, optimized for Intel processors. “It is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).” ● It is implemented as a standalone package on Spark. ● Can be used with existing Spark or Hadoop clusters. ● High-performance powered by Intel MKL and multi-threaded programming. ● Easily scaled-out ● Appropriate for users who are not DL experts. 45
  • 46. @s_kontopoulos BigDL ● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and testing machine learning models. ● A lot of useful features: Loss Functions, Layers support etc ● Implements a parameter server for distributed training of DL models ● Support visualization via tensorboard: https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb oard 46
  • 47. @s_kontopoulos BigDL in practice For a cool example of using BigDL on mesos check our blog: http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/ 47