Serve and Scale ML Models ( Low latency prediction systems) at Scale

Serve ML Models (low-latency prediction
systems) at scale in the Cloud and at the Edge
Srininivasa Rao Aravilli
Senior Engineering Manager
Cisco Systems
Aravilli

About me
Name : Srinivasa Rao Aravilli
Experience : 23 years ( wish my age now J )
Interests : Distributed Computing, AI/ML, Security and Cloud
Patent : Reinforcement Learning based software recommendations for network devices
Papers : Arxiv: VEDAR (Anomaly Detection ), Advaita ( Bug Duplicity Detection System)
SOA Journal: Various papers related to SOAP, UDDI, JAX-RPC …
Speakers in various conferences : AI/ML Talks
Coach/Mentor : Advanced Certification in Machine Learning and Cloud -
Course from : IIT Madras and upGrad

Advaita – Flow Diagram/ML Pipeline (Offline /Online Mode)
Offline - detecting duplicates for list of new bugs which are already filed for a given product
Online - detecting duplicates while filing a new bug in the bugs systems
New
Bug/Bugs
Feature
ExtractionBugs
5ML
Model
Preprocessing
1 2
3
Preprocessing
Feature Extraction
4
Probable Duplicate
Bugs
6

Use case : Bug Duplicity Detection System
PredictionsML ModelData Set
§ Open Source Systems Bugs
§ Number of Bugs ( Firefox) =
~37,000
§ Framework : XGBoost
§ Classification : Binary
§ Features :
Syntax, Semantic, Edit
Distances, word embeddings,
fast-text
How to serve the predictions
at scale with low latency ?
New Bugs ( for Online )
Existing Bugs ( Batch )

One of the possible solution…..
Bug Duplicate
Dection System
(XgBoost)
Network Failure
(Spark)
PII
MXNet
New System
Bug Duplicate
Serving System
(XgBoost)
N/F Server
System
(Spark)
PII
Serving System
MXNet
Phishing
Serving System
Phishing
(Scikit)
Business APP
????

Use case : Phishing websites – Detection
§ Phishing Websites Data Set
§ Data Set size ~2500
§ Number of attributes = 30
§ Classification = Binary
https://archive.ics.uci.edu/ml/da
tasets/Phishing+Websites
§ Framework : Scikit Learn
§ Classifier = Random Forrest
§ Features : 30
§ Model Persistence :
Joblib or Pickle
https://github.com/aravilli/Med
ha-AI/blob/master/Phishing-
RF.ipynb
How to serve the model
predictions at scale with low
latency?

Use case : Device Failures Detection
§ Syslogs and Config Files
§ Billons of historic syslogs
§ Framework : Spark ML
§ Unsupervised Learning
(Clustering), Association
Mining
at scale with low latency at
the edge?

name user|
Public
Confidential
Highly Confidential
Restricted
name
Public
Confidential
Highly Confidential
Restricted
name support|
Public
Confidential
Highly Confidential
Restricted
name product|
Public
Confidential
Highly Confidential
Restricted
APTA : Context Aware Automatic Detection of Sensitive
Terms in documents
Lets Add some context

APTA:Personally identifiable information - Detection
PredictionsML Model
§ Framework : MXNet
§ Classification : Multi Class
at scale with low latency at
the edge?
Streaming and Batch
Data Set
§ SQL Files
§ Documents

Challenges to serve these models ….
• Building & maintaining separate severing systems for each framework is
expensive and maintenance
Bug Duplicate
Dection System
(XgBoost)
Network Failure
(Spark)
PII
MXNet
Multiple Models
Bug Duplicate
Serving System
(XgBoost)
N/F Server
System
(Spark)
PII
Serving System
MXNet
Phishing
Serving System
Phishing
(Scikit)
Business APP

Challenges to serve these models ….
• Building and Serving pre-materialized predictions have significant
computation, space costs , costly updates and may not possible in all use
cases

Clipper (A low-latency prediction-serving system )
Developed by riselab @ US Berkeley
Github : https://github.com/ucbrise/clipper
https://www.usenix.org/sites/default/files/conference/protected-
files/nsdi17_slides_crankshaw.pdf
http://learningsys.org/nips17/assets/slides/clipper-nips17.pdf

Clipper Architecture
Source : http://clipper.ai/tutorials/basic_concepts/

Model Deployment, Versioning, Replication

Model Creation &
Persistence
Clipper Installation Starting Cluster Model Linking
Model
Serving

pip install clipper_admin
Model Creation &
Persistence
Clipper Installation Starting Cluster Model Linking
Model
Serving

Clipper
Installation
Starting Cluster
Model Load &
Deployment
Model Linking
Model
Serving

Clipper
Installation
Starting Cluster
Model
Deployment
App Registration &
Model Linking
Model
Serving

Clipper
Models support
• Clipper provides the following deployer
modules:
• Arbitrary Python functions
• PySpark Models
• PyTorch Models
• Tensorflow Models
• MXNet Models
• PyTorch Models exported
as ONNX file with Caffe2
Serving Backend

Ray Serve: A serving system for any scale
Source: https://risecamp.berkeley.edu/

Serve and Scale ML Models ( Low latency prediction systems) at Scale

Recommended

Recommended

More Related Content

Similar to Serve and Scale ML Models ( Low latency prediction systems) at Scale

Similar to Serve and Scale ML Models ( Low latency prediction systems) at Scale (20)

Recently uploaded

Recently uploaded (20)

Serve and Scale ML Models ( Low latency prediction systems) at Scale