2. @s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare stavroskontopoulos
stavroskontopoulos
All trademarks and registered trademarks are property of their respective holders.
3. @s_kontopoulos
Agenda
- ML in the Enterprise
- ML from development to production
- Key technologies: Apache Spark as a case study
3
4. @s_kontopoulos
ML in the Enterprise
ML is a key tool that fuels the effort of coupling business monitoring (BI) with
predictive and prescriptive analytics.
business insights -> business optimization -> data monetization
4
5. @s_kontopoulos
ML in the Enterprise - The Data-Science LifeCycle
Identify Business Question
Identify and collect related Data
Data cleansing, feature extraction (Data pre-processing)
Experiment planning
Model Building
Model Evaluation
Model Deployment/Management in Production
Model Optimization - Performance
5
6. @s_kontopoulos
Machine Learning Model
A model is a function that maps inputs to outputs and essentially expresses a
mathematical abstraction.
Linear Regression:
Neural Network:
Random Forest:
Function composition
6
7. @s_kontopoulos
Model Evolution
- Models can be either pre-computed eg. trained off-line or updated on-line.
- Online ML with Streaming:
- Pure online means only use the latest arrived data point to update the model. Usually models
are updated per batch/window eg. online k-means though.
- An interesting case is when we sample the stream and train a model only when the distribution
changes.
- Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling
- Re-train the model by ignoring the previous one.
7
8. @s_kontopoulos
Machine Learning Pipeline
Machine learning pipeline in Production: describes all steps from data
preprocessing before feeding the model to model output processing
(post-processing).
8
9. @s_kontopoulos
Machine Learning Pipeline in Libraries
Pros:
- Data and test data go through the same steps
- Like a CI (continuous integration) pipeline people can reason about data
transformation
- Caching of computations
- Model serving easier 9
10. @s_kontopoulos
Multiple Models in a Pipeline
Within the same pipeline it is also possible to run multiple models:
a) Model Segmentation
b) Model Ensemble
c) Model Chaining
d) Model Composition
http://dmg.org/pmml/v4-1/MultipleModels.html
http://dl.acm.org/citation.cfm?id=1859403
10
11. @s_kontopoulos
Model Development & Production
Data Scientist
All trademarks and registered trademarks are property of their respective holders.
GO
Data Engineer
11
13. @s_kontopoulos
Model Standardization
13
- PFA or PMML won’t break the pipeline. PFA is more flexible than PMML.
“Unlike PMML, PFA has control structures to direct program flow, a true type system for both
model parameters and data, and its statistical functions are much more finely grained and can
accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/)
- Custom model definitions and implementations are more flexible or more
optimized but could break the pipeline.
- Some Implementations:
- https://github.com/jpmml/jpmml-evaluator-spark
- https://github.com/jpmml
- https://github.com/opendatagroup/hadrian
15. @s_kontopoulos
Model Governance
● governed by the company’s policies and procedures, laws and regulations
and organization’s goals
● searchable across company
● be transparent, explainable, traceable and interpretable for auditors and
regulators. Example GDPR requirements:
https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-
the-gdpr/
● have approval and release process
15
16. @s_kontopoulos
Model Server
“A model server is a system which handles the lifecycle of a model and provides
the required APIs for deploying a model/pipeline.”
Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/
CLIPPER Tensorflow Serving
16
17. @s_kontopoulos
Model Serving - Requirements
Other requirements:
- Response time - time to calculate a prediction. Could be a few mills.
- Throughput - predictions per second.
- Support for running multiple models (very common to run hundreds of models
eg. A telecom operator where there is one model per customer or in IoT one
model per site/sensor).
17
18. @s_kontopoulos
Model Serving - Requirements
- multiple versions of the same machine learning pipeline within the system.
One reason can be A/B testing.
- Model update- How quickly and easy a model can be updated?
- Uptime/reliability
18
19. @s_kontopoulos
Tensorflow Serving Issues
Not all systems cover the requirements. For example:
● Metadata not available. (https://github.com/tensorflow/serving/issues/612)
● No new models at runtime: (https://github.com/tensorflow/serving/issues/422)
● Can be hard to build from scratch:
https://github.com/tensorflow/serving/issues/327
19
20. @s_kontopoulos
Model Serving with Apache Flink
Apache Flink: Low latency compared to Spark streaming engine based on the
Beam model.
20
21. @s_kontopoulos
Model Serving with Apache Flink
Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline
models loaded from a permanent storage and update them without interruption.
FLIP Proposal:
(https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8
oGRPsPuk8)
Combines different efforts: https://github.com/FlinkML
● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/)
● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky)
● https://github.com/FlinkML/flink-tensorflow (Eron Wright)
21
22. @s_kontopoulos
Model Serving with Apache Flink
22
Use a control stream and a data Stream. Keep model in operator’s state. Join the streams.
Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and
partitions-based join based on RichCoFlatMapFunction.
23. @s_kontopoulos
Model Serving with Apache Flink
23
More here:
https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
24. @s_kontopoulos
Data Lakes
How can we work with data to cover future needs and use cases. We need a
robust ML framework plus flexible infrastructure. Data Warehouses will not work.
Data lake to the rescue.
“A data lake is a method of storing data within a system or repository, in its natural
format, that facilitates the collocation of data in various schemata and structural
forms, usually object blobs or files.”
- Wikipedia
24
25. @s_kontopoulos
Data Lakes
● Agility. It can be seen as a tool that makes data accessible to different users
and facilitates ML.
● Designed for low-cost storage
● Schema on read
● Security and governance still maturing.
25
26. @s_kontopoulos
Data Lake Issues
“Through 2018, 80% of data lakes will not include effective metadata management
capabilities, making them inefficient.”
- Gartner
Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM
Watson Platform etc.
26
27. @s_kontopoulos
Notebooks
Very convenient for the data scientist or the analyst.
Production usually is based on traditional deployment methods.
- Spark Notebook
- Apache zeppelin
- Jupyter
27
28. @s_kontopoulos
ML with Apache Spark
“A popular big data framework for ML and data-science.”
- You can work locally and move to production fast
- ETL/Feature Engineering
- Hyper-parameter tuning
- Rich Model support
- Multiple language support (Scala, Java, Python, R)
28
30. @s_kontopoulos
Apache Spark - Intro
- User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG.
- DAG is shipped to nodes where the data lie, computation is executed and
results are sent back to the user.
- The data-sets are considered as immutable distributed data (RDDs).
- Resilient Distributed Datasets (RDD) an immutable distributed
collection of objects.
30
32. @s_kontopoulos
Apache Spark - Intro
There are three APIs: RDD, DataFrames, Datasets
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat
aframes-and-datasets.html
32
RDD DataFrames (SQL) Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
33. @s_kontopoulos
Apache Spark - Intro
“Datasets support encoders which allow to map semi-structured formats (eg
JSON) to constructs of type safe languages (Scala, Java). Also they have better
performance compared to java serialization or kryo.”
33
34. @s_kontopoulos
MLliB
A library for machine learning on top of Spark. Has two APIs:
- RDD based (spark.mllib).
- Datasets / Dataframes based (spark.ml).
The latter is relatively new and makes it easier to construct a ML pipeline or run an
algorithm. The first is older with more features.
34
35. @s_kontopoulos
MLliB
“As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered
maintenance mode. “
What are the implications?
● MLlib will still support the RDD-based API in spark.mllib with bug fixes.
● MLlib will not add new features to the RDD-based API.
● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
● The RDD-based API is expected to be removed in Spark 3.0.
35
36. @s_kontopoulos
MLliB
Supports different categories of ML algorithms:
● Basic statistics (correlations etc)
● Pipelines (LSH, TF-IDF)
● Extracting, transforming and selecting features
● Classification and Regression (Random forests, Gradient boosted trees)
● Clustering (K-means, LDA, etc)
● Collaborative filtering
● Frequent Pattern Mining
● Model selection and tuning
Allows to implement: Fraud detection, Recommendation engines,...
36
37. @s_kontopoulos
MLliB Local
A new package is available for production use of the algorithms without the need
of Spark itself. How about PMML vs this method?
https://issues.apache.org/jira/browse/SPARK-13944
https://issues.apache.org/jira/browse/SPARK-16365
37
38. @s_kontopoulos
MLliB - Unsupervised Learning Example
Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data
Describes wine quality. Different dimensions like: chlorides, sugar etc.
We will apply k-means to identify different clusters of wine quality.
Implemented both mllib and ml implementations as spark notebooks.
38
Normalize Data K-means PCA Visualize
44. @s_kontopoulos
Spark Deep Learning Pipelines
- People know SQL
- Models are productized as SQL UDFS.
Predictions as a SQL statement:
SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table
https://github.com/databricks/spark-deep-learning
44
45. @s_kontopoulos
BigDL
● Developed by Intel.
● It does not use GPUs, optimized for Intel processors.
“It is orders of magnitude faster than out-of-box open source Caffe, Torch or
TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).”
● It is implemented as a standalone package on Spark.
● Can be used with existing Spark or Hadoop clusters.
● High-performance powered by Intel MKL and multi-threaded programming.
● Easily scaled-out
● Appropriate for users who are not DL experts.
45
46. @s_kontopoulos
BigDL
● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and
testing machine learning models.
● A lot of useful features: Loss Functions, Layers support etc
● Implements a parameter server for distributed training of DL models
● Support visualization via tensorboard:
https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb
oard
46
47. @s_kontopoulos
BigDL in practice
For a cool example of using BigDL on mesos check our blog:
http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/
47