SlideShare a Scribd company logo
1 of 30
Download to read offline
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
UC Berkeley, 05/23/2018
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus
Weimer, Marco D. Santambrogio, Byung-Gon Chun
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems
ML-as-a-Service
• Trained ML models are often deployed on cloud
platforms as black boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure (= sequence of
operations)
- Similar state (= read-only memory
objects)
Two key requirements in MaaS
1. Performance: latency or throughput
– Limited by data copy, JIT, virtual calls, GC …
2. Model density: number of models per machine
– Heavily affects operational costs
– State objects can be sizeable (100s MB)
– When RAM exhausts spill to disk, killing latency
3
Requirements and limitations
We need to know structure and state:
white-box model
4
Breaking the black-box model
We need to know structure and state:
white-box model
4
Breaking the black-box model
1. To generate an optimised version of a model on
deployment: higher performance
2. To allocate shared state only once, and share among
models: higher density
‣ Opening the black box
‣ Pretzel, white-box serving system
‣ Evaluation
‣ Conclusions and future work
5
Outline
OPENING THE BLACK BOX
6
• Models are Direct Acyclic Graphs of transformations:
input parsers, featurizers, concatenation, predictors, …
• We started from 300 production-like models written in
ML.NET [1], in C#
• We looked at their structure and found common
patterns
- Models are constructed incrementally
- often from pre-defined templates
7
Case study
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
8
Runtime characteristics
• Example is a sentiment analysis pipeline trained on the
Amazon Reviews dataset [2] we analysed previously [3]
• Distinct transformations are distinct calls
• Most of the time is spent in featurization
• No single bottleneck
ck Box of Machine Learning
Serving Systems
per # 355
of
o
g-
h
e
v-
y
a-
nt
a
d
e
s
r-
s
e
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
unique feature vector which is then scored by a Linear
Regression predictor. This is a simplification: the actual
DAG contains about 12 operators.
Fig. 2. Execution breakdown of the example model
2 THE CASE FOR ACCELERATING GENERIC PRE-
DICTION PIPELINES
w
v
(0
m
an
w
su
th
w
m
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of
the 25th International Conference on World Wide Web, WWW ’16
[3] A. Scolari, Y. Lee, M. Weimer and M. Interlandi, "Towards Accelerating Generic Machine Learning Prediction Pipelines," 2017 IEEE International
Conference on Computer Design (ICCD)
9
Limitations of black-box
• Overheads are a remarkable % of runtime
- JIT, GC, virtual calls, …
• Initial (cold) predictions are 100x slower
• Operators have different buffers and
break locality
• Operators are executed one at a time:
no code fusion, multiple data accesses
9
Limitations of black-box(a) We identify the operators by their parameters. The fi
operators, which have multiple versions with different len
Figure 3: Probability for an operator within the 250
250⇥ y
100 pipelines have operator x, implying that pi
Figure 4: CDF of latency of prediction requests of
DAGs. We denote the first prediction as cold; the
line is reported as average over 100 predictions aft
warm-up period of 10. The plot is normalized over
99th percentile latency of the hot case.
describes this situation, where the performance of
predictions over the 250 sentiment analysis pipelines
memory already allocated and JIT-compiled code is m
than an order-of-magnitude faster then the cold ver
for the same pipelines. To drill down more into the p
lem, we found that 57.4% of the total execution time f
single cold prediction is spent in pipeline analysis and
tialization of the function chain, 36.5% in JIT compila
and the remaining is actual computation time.
Limited performance
• Overheads are a remarkable % of runtime
- JIT, GC, virtual calls, …
• Initial (cold) predictions are 100x slower
• Operators have different buffers and
break locality
• Operators are executed one at a time:
no code fusion, multiple data accesses
9
Limitations of black-box(a) We identify the operators by their parameters. The fi
operators, which have multiple versions with different len
Figure 3: Probability for an operator within the 250
250⇥ y
100 pipelines have operator x, implying that pi
Figure 4: CDF of latency of prediction requests of
DAGs. We denote the first prediction as cold; the
line is reported as average over 100 predictions aft
warm-up period of 10. The plot is normalized over
99th percentile latency of the hot case.
describes this situation, where the performance of
predictions over the 250 sentiment analysis pipelines
memory already allocated and JIT-compiled code is m
than an order-of-magnitude faster then the cold ver
for the same pipelines. To drill down more into the p
lem, we found that 57.4% of the total execution time f
single cold prediction is spent in pipeline analysis and
tialization of the function chain, 36.5% in JIT compila
and the remaining is actual computation time.
(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
Figure 3: Probability for an operator within the 250 different pipelines. If an operato
y
• No state sharing: state is
duplicated-> memory is
wasted
Limited performance
Limited density
• Optimisations for single operators, like DNNs [4-5]
• TensorFlow Serving [6] as Servable Python objects
• ML.NET as zip files with state files and DLLs
• Clipper [7] and Rafiki [8] deploy pipelines as Docker
containers
– They schedule requests based on latency target
– Can apply caching and batching
• MauveDB [9] accepts regression and interpolation model and
optimises them as DB views
10
Related work
[4] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[5] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017
[6] C.Olston, et al., and V. Rajashekhar. Tensorflow- serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017
[7] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017
[8] W.Wang,S.Wang,J.Gao, et al.. Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018
[9] A. Deshpande and S. Madden. Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006
PRETZEL, WHITE-BOX SERVING SYSTEM
11
White Box Prediction Serving: make pipelines co-
exist better, schedule better
1. End-to-end Optimizations: merge operators in
computational units (logical stages), to
decrease overheads
2. Multi-model Optimizations: create once, use
everywhere, for both data and stages
12
Design principles
Flour is a language-interpreted API to express ML pipeline in
dataflow style, like Spark or LINQ (currently only ML.NET)
13
Off-line phase [1] - Flour
ages the dynamic decisions on how to schedule plans
based on machine workload. Finally, a FrontEnd is used
to submit prediction requests to the system.
Model pipeline deployment and serving in PRETZEL
follow a two-phase process. During the off-line phase
(Section 4.1), ML2’s pre-trained pipelines are translated
into Flour transformation-based API. Oven optimizer re-
arranges and fuses transformations into model plans com-
posed of parameterized logical units called stages. Each
logical stage is then AOT compiled into physical computa-
tion units where memory resources and threads are pooled
at runtime. Model plans are registered for prediction serv-
ing in the Runtime where physical stages and parameters
are shared between pipelines with similar model plans. In
the on-line phase (Section 4.2), when an inference request
for a registered model plan is received, physical stages are
parameterized dynamically with the proper values main-
tained in the Object Store. The Scheduler is in charge of
binding physical stages to shared execution units.
Figures 6 and 7 pictorially summarize the above de-
scriptions; note that only the on-line phase is executed
at inference time, whereas the model plans are generated
completely off-line. Next, we will describe each layer
composing the PRETZEL prediction system.
4.1 Off-line Phase
4.1.1 Flour
The goal of Flour is to provide an intermediate represen-
tation between ML frameworks (currently only ML2) and
PRETZEL that is both easy to target and amenable to op-
timizations. Once a pipeline is ported into Flour, it can
be optimized and compiled (Section 4.1.2) into a model
arrays indicate the number and type of input fields. The
successive call to Tokenize in line 4 splits the input
fields into tokens. Lines 6 and 7 contain the two branches
defining the char-level and word-level n-gram transforma-
tions, which are then merged with the Concat transform
in line 9 before the linear binary classifier of line 10. Both
char and word n-gram transformations are parametrized
by the number of n-grams and maps translating n-grams
into numerical format (not shown in the Figure). Addi-
tionally, each Flour transformation accepts as input an
optional set of statistics gathered from training. These
statistics are used by the compiler to generate physical
plans more efficiently tailored to the model characteristics.
Example statistics are max vector size (to define the mini-
mum size of vectors to fetch from the pool at prediction
time 4.2), dense or sparse representations, etc.
Listing 1: Flour program for the sentiment analysis
pipeline. Transformations’ parameters are extracted from
the original ML2 pipeline.
1 var fContext = new FlourContext(objectStore, ...)
2 var tTokenizer = fContext.CSV.
3 FromText(fields, fieldsType,
sep).
4 Tokenize();
5
6 var tCNgram = tTokenizer.CharNgram(numCNgrms,
...);
7 var tWNgram = tTokenizer.WordNgram(numWNgrms,
...);
8 var fPrgrm = tCNgram.
9 Concat(tWNgram).
10 ClassifierBinaryLinear(cParams);
11
12 return fPrgrm.Plan();
We have instrumented the ML2 library to collect statis-
tics from training and with the related bindings to the
Object Store and Flour to automatically extract Flour
e Black Box of Machine Learning
ction Serving Systems
Paper # 355
mposed of
gn allows to
at training-
ments such
erformance
iction serv-
s, whereby
nored in fa-
we present
roducing a
end-to-end
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
execution code of different physical im
physical implementation is selected ba
ters characterizing a logical stage and
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
ters characterizing a logical stage and available statisti
Plan compilation is a two step process. After the sta
DAG is generated by the Oven Optimizer, the Mod
Plan Compiler (MPC) maps each stage into its logic
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
optimizer
Store (Se
MPC trav
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
DAG is generated by the
Plan Compiler (MPC) ma
representation containing a
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the element
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
vectors or matrices multiplications) are executed one-at
a-time so that Single Instruction, Multiple Data (SIMD
vectorization can be exploited, therefore optimizing the
Flour API
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
14
Off-line phase [2] - Oven
[9] A.Crotty,A.Galakatos,K.Dursun, et al. An architecture for compiling UDF-centric workflows. PVLDB, 8(12):1466–1477, Aug. 2015
• Oven optimises pipelines much like queries
• Tupleware’s hybrid approach [9]
– memory-intensive operators are merged into a logical stage,
to increase data locality
– Compute-intensive operators stay alone, to leverage SIMD
– One-to-one operators are chained together, while one-to-
many can break locality
• Oven’s Model Plan Compiler compiles logical stages into
physical stages, based on parameters and statistics
• Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources:
– Executor threads
– Per-executor vector pools for buffers
– Accepts requests in batches or request-response mode
• Object Store caches objects of all pipelines
– During initialisation, a pipelines registers its objects
– During inference, it retrieves them from OS
– allows controlling locality of objects, replication, …
15
On-line phase [1] - Runtime
• Event-based scheduling: each physical stage to be run
corresponds to an event
– First stage of new requests has low priority
– Following stages have high priority
• Buffer vectors are allocated lazily at the beginning of
the request serving
– Different priorities help latency and buffer locality
– On prediction end, buffers are recycled
16
On-line phase [2] - Scheduler
EVALUATION
17
• Two model classes written in ML.NET, running in ML.NET and Pretzel
– 250 Sentiment-Analysis models
– 50 Regression Task models
• Testbed representing a small production server
– 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT enabled
– 32 GB RAM
– Windows 10
• Four scenarios:
– Memory: evaluate models memory density thanks to state sharing
– Latency: evaluate request-response latency thanks to optimisations
– Batch: evaluate scheduler effectiveness
– Heavy-load: evaluate mixed scenarios with latency degradation
18
Workload and testbed
• 7x less memory for SA models, 40% for RT models
• Higher model density, higher efficiency and
profitability
19
Memory
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
the maximum usage with Object Store for SA models.
Figure 9: Latency co
ZEL, with values nor
Cliffs indicate pipeli
• Hot vs Cold scenarios
• Without and with Caching of partial results
20
Latency
Figure 9: Latency comparison between ML2 and PRET-
Cumulative memory usage of the model
ith and without Object Store, normalized by
um usage with Object Store for SA models.
common objects, we can decrease memory
to 7 ⇥ in SA and around 40% for RT models.
ck of plan parameters can also help reduce the
ding models: as expected PRETZEL can load
7.4 times faster compared to ML2.
ency
riment we study the latency behavior of PRET-
Figure 9: Latency comparison between ML2 and PRE
ZEL, with values normalized over ML2’s P99hot latenc
Cliffs indicate pipelines with big differences in latency.
Figure 10: Latency of PRETZEL to run SA models wit
W/o Caching W/o and w/ Caching
• ML.NET cold: 10-300x w.r.t. hot
• Pretzel cold: 3-20x slower than hot
• Pretzel vs ML:NET: cold is 15x faster,
hot is 2.6x
• For single models runs, average
improvement is 2.5x
• More gain with multiple requests
• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given delay
- Batch while latency is smaller than tolerated delay
21
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.
the number of cores on the x-axis. PRETZEL scales lin
early to the number of CPU cores, close to the expecte
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memor
subsystem: indeed, even if the data values are the sam
the model objects are mapped to different memory area
Delay Batching: As in Clipper, PRETZEL FrontEnd a
lows users to specify a maximum delay they can wait t
maximize throughput. For each model category, Figure 1
depicts the trade-off between throughput and latency a
the delay increases. Interestingly, for such models even
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT model
W/o Delay batching
• ML.NET suffers from missed data
sharing, i.e. higher memory traffic
• Almost linear scaling
W/ Delay batching
• Not supported in ML.NET
• Helps reaching optimal batch size,
while latency degrades gracefully
• Load all 300 models upfront
• 3 threads to generate requests, 29 to score
• One core serves single requests, the others serve batches
• ML.NET does not support this scenarios
• Latency degrades gracefully with load
22
Heavy load
puted among the
s each. We scale
TZEL scales lin-
to the expected
latency, instead, gracefully increases linearly with the
increase of the load.
Figure 13: Throughput and latency of PRETZEL under
CONCLUSIONS AND FUTURE WORK
23
• We addressed performance/density bottlenecks in ML
inference for MaaS
• We advocate the adoption of a white-box approach
• We are re-thinking ML inference as a DB query problem
- buffer/locality issues
- code generation
- runtime, hardware-aware scheduling
24
Conclusions
• Oven has pre-written implementations for logical and
physical stages: not maintainable in the long run
• We are adding code-generation of stages, maybe with
hardware-specific templates [10]
• Tune code-generation for more-aggressive optimizations
• Support more model formats than ML.NET, like ONNX [11]
25
Off-line phase
[10] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010
[11] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017
• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
26
On-line phase
26
On-line phase
Pretzel is currently under submission at OSDI ‘18
QUESTIONS ?
• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it

More Related Content

What's hot

Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 

What's hot (19)

Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platforms
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
 
Pregel reading circle
Pregel reading circlePregel reading circle
Pregel reading circle
 
Chap11 slides
Chap11 slidesChap11 slides
Chap11 slides
 
PRAM algorithms from deepika
PRAM algorithms from deepikaPRAM algorithms from deepika
PRAM algorithms from deepika
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Matopt
MatoptMatopt
Matopt
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Software effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkSoftware effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN network
 
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
 
Static Analysis of Computer programs
Static Analysis of Computer programs Static Analysis of Computer programs
Static Analysis of Computer programs
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
1844 1849
1844 18491844 1849
1844 1849
 

Similar to Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads.

Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Lightbend
 
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULINGDESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
Chih-Ting Tsai
 
DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paper
prreiya
 

Similar to Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads. (20)

Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
60
6060
60
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentGenetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing Environment
 
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULINGDESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Cc module 3.pptx
Cc module 3.pptxCc module 3.pptx
Cc module 3.pptx
 
DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paper
 
IncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the CloudIncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the Cloud
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads.

  • 1. Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it UC Berkeley, 05/23/2018 Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus Weimer, Marco D. Santambrogio, Byung-Gon Chun PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
  • 2. ML-as-a-Service • Trained ML models are often deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure (= sequence of operations) - Similar state (= read-only memory objects)
  • 3. Two key requirements in MaaS 1. Performance: latency or throughput – Limited by data copy, JIT, virtual calls, GC … 2. Model density: number of models per machine – Heavily affects operational costs – State objects can be sizeable (100s MB) – When RAM exhausts spill to disk, killing latency 3 Requirements and limitations
  • 4. We need to know structure and state: white-box model 4 Breaking the black-box model
  • 5. We need to know structure and state: white-box model 4 Breaking the black-box model 1. To generate an optimised version of a model on deployment: higher performance 2. To allocate shared state only once, and share among models: higher density
  • 6. ‣ Opening the black box ‣ Pretzel, white-box serving system ‣ Evaluation ‣ Conclusions and future work 5 Outline
  • 8. • Models are Direct Acyclic Graphs of transformations: input parsers, featurizers, concatenation, predictors, … • We started from 300 production-like models written in ML.NET [1], in C# • We looked at their structure and found common patterns - Models are constructed incrementally - often from pre-defined templates 7 Case study [1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
  • 9. 8 Runtime characteristics • Example is a sentiment analysis pipeline trained on the Amazon Reviews dataset [2] we analysed previously [3] • Distinct transformations are distinct calls • Most of the time is spent in featurization • No single bottleneck ck Box of Machine Learning Serving Systems per # 355 of o g- h e v- y a- nt a d e s r- s e “This is a nice product” Positive vs. Negative Tokenizer Char Ngram Word Ngram Concat Linear Regression Figure 1: A sentimental analysis pipeline consisting of operators for featurization (ellipses), followed by a ML model (diamond). Tokenizer extracts tokens (e.g., words) from the input string. Char and Word Ngrams featurize input tokens by extracting n-grams. Concat generates a unique feature vector which is then scored by a Linear Regression predictor. This is a simplification: the actual DAG contains about 12 operators. Fig. 2. Execution breakdown of the example model 2 THE CASE FOR ACCELERATING GENERIC PRE- DICTION PIPELINES w v (0 m an w su th w m [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16 [3] A. Scolari, Y. Lee, M. Weimer and M. Interlandi, "Towards Accelerating Generic Machine Learning Prediction Pipelines," 2017 IEEE International Conference on Computer Design (ICCD)
  • 11. • Overheads are a remarkable % of runtime - JIT, GC, virtual calls, … • Initial (cold) predictions are 100x slower • Operators have different buffers and break locality • Operators are executed one at a time: no code fusion, multiple data accesses 9 Limitations of black-box(a) We identify the operators by their parameters. The fi operators, which have multiple versions with different len Figure 3: Probability for an operator within the 250 250⇥ y 100 pipelines have operator x, implying that pi Figure 4: CDF of latency of prediction requests of DAGs. We denote the first prediction as cold; the line is reported as average over 100 predictions aft warm-up period of 10. The plot is normalized over 99th percentile latency of the hot case. describes this situation, where the performance of predictions over the 250 sentiment analysis pipelines memory already allocated and JIT-compiled code is m than an order-of-magnitude faster then the cold ver for the same pipelines. To drill down more into the p lem, we found that 57.4% of the total execution time f single cold prediction is spent in pipeline analysis and tialization of the function chain, 36.5% in JIT compila and the remaining is actual computation time. Limited performance
  • 12. • Overheads are a remarkable % of runtime - JIT, GC, virtual calls, … • Initial (cold) predictions are 100x slower • Operators have different buffers and break locality • Operators are executed one at a time: no code fusion, multiple data accesses 9 Limitations of black-box(a) We identify the operators by their parameters. The fi operators, which have multiple versions with different len Figure 3: Probability for an operator within the 250 250⇥ y 100 pipelines have operator x, implying that pi Figure 4: CDF of latency of prediction requests of DAGs. We denote the first prediction as cold; the line is reported as average over 100 predictions aft warm-up period of 10. The plot is normalized over 99th percentile latency of the hot case. describes this situation, where the performance of predictions over the 250 sentiment analysis pipelines memory already allocated and JIT-compiled code is m than an order-of-magnitude faster then the cold ver for the same pipelines. To drill down more into the p lem, we found that 57.4% of the total execution time f single cold prediction is spent in pipeline analysis and tialization of the function chain, 36.5% in JIT compila and the remaining is actual computation time. (a) We identify the operators by their parameters. The first two groups represent N-gram operators, which have multiple versions with different length (e.g., unigram, trigram) Figure 3: Probability for an operator within the 250 different pipelines. If an operato y • No state sharing: state is duplicated-> memory is wasted Limited performance Limited density
  • 13. • Optimisations for single operators, like DNNs [4-5] • TensorFlow Serving [6] as Servable Python objects • ML.NET as zip files with state files and DLLs • Clipper [7] and Rafiki [8] deploy pipelines as Docker containers – They schedule requests based on latency target – Can apply caching and batching • MauveDB [9] accepts regression and interpolation model and optimises them as DB views 10 Related work [4] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ [5] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017 [6] C.Olston, et al., and V. Rajashekhar. Tensorflow- serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017 [7] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017 [8] W.Wang,S.Wang,J.Gao, et al.. Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018 [9] A. Deshpande and S. Madden. Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006
  • 15. White Box Prediction Serving: make pipelines co- exist better, schedule better 1. End-to-end Optimizations: merge operators in computational units (logical stages), to decrease overheads 2. Multi-model Optimizations: create once, use everywhere, for both data and stages 12 Design principles
  • 16. Flour is a language-interpreted API to express ML pipeline in dataflow style, like Spark or LINQ (currently only ML.NET) 13 Off-line phase [1] - Flour ages the dynamic decisions on how to schedule plans based on machine workload. Finally, a FrontEnd is used to submit prediction requests to the system. Model pipeline deployment and serving in PRETZEL follow a two-phase process. During the off-line phase (Section 4.1), ML2’s pre-trained pipelines are translated into Flour transformation-based API. Oven optimizer re- arranges and fuses transformations into model plans com- posed of parameterized logical units called stages. Each logical stage is then AOT compiled into physical computa- tion units where memory resources and threads are pooled at runtime. Model plans are registered for prediction serv- ing in the Runtime where physical stages and parameters are shared between pipelines with similar model plans. In the on-line phase (Section 4.2), when an inference request for a registered model plan is received, physical stages are parameterized dynamically with the proper values main- tained in the Object Store. The Scheduler is in charge of binding physical stages to shared execution units. Figures 6 and 7 pictorially summarize the above de- scriptions; note that only the on-line phase is executed at inference time, whereas the model plans are generated completely off-line. Next, we will describe each layer composing the PRETZEL prediction system. 4.1 Off-line Phase 4.1.1 Flour The goal of Flour is to provide an intermediate represen- tation between ML frameworks (currently only ML2) and PRETZEL that is both easy to target and amenable to op- timizations. Once a pipeline is ported into Flour, it can be optimized and compiled (Section 4.1.2) into a model arrays indicate the number and type of input fields. The successive call to Tokenize in line 4 splits the input fields into tokens. Lines 6 and 7 contain the two branches defining the char-level and word-level n-gram transforma- tions, which are then merged with the Concat transform in line 9 before the linear binary classifier of line 10. Both char and word n-gram transformations are parametrized by the number of n-grams and maps translating n-grams into numerical format (not shown in the Figure). Addi- tionally, each Flour transformation accepts as input an optional set of statistics gathered from training. These statistics are used by the compiler to generate physical plans more efficiently tailored to the model characteristics. Example statistics are max vector size (to define the mini- mum size of vectors to fetch from the pool at prediction time 4.2), dense or sparse representations, etc. Listing 1: Flour program for the sentiment analysis pipeline. Transformations’ parameters are extracted from the original ML2 pipeline. 1 var fContext = new FlourContext(objectStore, ...) 2 var tTokenizer = fContext.CSV. 3 FromText(fields, fieldsType, sep). 4 Tokenize(); 5 6 var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); 7 var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); 8 var fPrgrm = tCNgram. 9 Concat(tWNgram). 10 ClassifierBinaryLinear(cParams); 11 12 return fPrgrm.Plan(); We have instrumented the ML2 library to collect statis- tics from training and with the related bindings to the Object Store and Flour to automatically extract Flour e Black Box of Machine Learning ction Serving Systems Paper # 355 mposed of gn allows to at training- ments such erformance iction serv- s, whereby nored in fa- we present roducing a end-to-end “This is a nice product” Positive vs. Negative Tokenizer Char Ngram Word Ngram Concat Linear Regression Figure 1: A sentimental analysis pipeline consisting of operators for featurization (ellipses), followed by a ML model (diamond). Tokenizer extracts tokens (e.g., words) var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage execution code of different physical im physical implementation is selected ba ters characterizing a logical stage and var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param ters characterizing a logical stage and available statisti Plan compilation is a two step process. After the sta DAG is generated by the Oven Optimizer, the Mod Plan Compiler (MPC) maps each stage into its logic var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- recognize Char and of Concat Char and CharNgra CharNgra created. T stages, ve Model Pl two DAG DAG of p tion of the lated para that will b given DA physical s execution physical i ters chara Plan co DAG is g Plan Com representa formation optimizer Store (Se MPC trav var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. recognize that the Linear R Char and WordNgram, ther of Concat. Additionally, To Char and WordNgram, ther CharNgram (in one stage) CharNgram and WordNGr created. The final plan wil stages, versus the initial 4 o Model Plan Compiler: M two DAGs: a DAG comp DAG of physical stages. L tion of the stages output of lated parameters; physical s that will be executed by the given DAG, there is a 1-to- physical stages so that a lo execution code of different physical implementation is ters characterizing a logica Plan compilation is a two DAG is generated by the Plan Compiler (MPC) ma representation containing a Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET ZEL. In (1), a model is translated into a Flour program. (2 Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the element and is fed to the runtime. (such as most featurizers) are pipelined together in a sin gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg isters [33, 38]. Compute-intensive transformations (e.g vectors or matrices multiplications) are executed one-at a-time so that Single Instruction, Multiple Data (SIMD vectorization can be exploited, therefore optimizing the Flour API Oven Optimiser var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen-
  • 17. 14 Off-line phase [2] - Oven [9] A.Crotty,A.Galakatos,K.Dursun, et al. An architecture for compiling UDF-centric workflows. PVLDB, 8(12):1466–1477, Aug. 2015 • Oven optimises pipelines much like queries • Tupleware’s hybrid approach [9] – memory-intensive operators are merged into a logical stage, to increase data locality – Compute-intensive operators stay alone, to leverage SIMD – One-to-one operators are chained together, while one-to- many can break locality • Oven’s Model Plan Compiler compiles logical stages into physical stages, based on parameters and statistics
  • 18. • Two main components: – Runtime, with an Object Store – Scheduler • Runtime handles physical resources: – Executor threads – Per-executor vector pools for buffers – Accepts requests in batches or request-response mode • Object Store caches objects of all pipelines – During initialisation, a pipelines registers its objects – During inference, it retrieves them from OS – allows controlling locality of objects, replication, … 15 On-line phase [1] - Runtime
  • 19. • Event-based scheduling: each physical stage to be run corresponds to an event – First stage of new requests has low priority – Following stages have high priority • Buffer vectors are allocated lazily at the beginning of the request serving – Different priorities help latency and buffer locality – On prediction end, buffers are recycled 16 On-line phase [2] - Scheduler
  • 21. • Two model classes written in ML.NET, running in ML.NET and Pretzel – 250 Sentiment-Analysis models – 50 Regression Task models • Testbed representing a small production server – 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT enabled – 32 GB RAM – Windows 10 • Four scenarios: – Memory: evaluate models memory density thanks to state sharing – Latency: evaluate request-response latency thanks to optimisations – Batch: evaluate scheduler effectiveness – Heavy-load: evaluate mixed scenarios with latency degradation 18 Workload and testbed
  • 22. • 7x less memory for SA models, 40% for RT models • Higher model density, higher efficiency and profitability 19 Memory Figure 8: Cumulative memory usage of the model pipelines with and without Object Store, normalized by the maximum usage with Object Store for SA models. Figure 9: Latency co ZEL, with values nor Cliffs indicate pipeli
  • 23. • Hot vs Cold scenarios • Without and with Caching of partial results 20 Latency Figure 9: Latency comparison between ML2 and PRET- Cumulative memory usage of the model ith and without Object Store, normalized by um usage with Object Store for SA models. common objects, we can decrease memory to 7 ⇥ in SA and around 40% for RT models. ck of plan parameters can also help reduce the ding models: as expected PRETZEL can load 7.4 times faster compared to ML2. ency riment we study the latency behavior of PRET- Figure 9: Latency comparison between ML2 and PRE ZEL, with values normalized over ML2’s P99hot latenc Cliffs indicate pipelines with big differences in latency. Figure 10: Latency of PRETZEL to run SA models wit W/o Caching W/o and w/ Caching • ML.NET cold: 10-300x w.r.t. hot • Pretzel cold: 3-20x slower than hot • Pretzel vs ML:NET: cold is 15x faster, hot is 2.6x • For single models runs, average improvement is 2.5x • More gain with multiple requests
  • 24. • Batch size is 1000 inputs, multiple runs of same batch • Delay batching: as in Clipper, let requests tolerate a given delay - Batch while latency is smaller than tolerated delay 21 Throughput Figure 11: The average throughput computed among the latency, instead, gracefully increases linearl increase of the load. the number of cores on the x-axis. PRETZEL scales lin early to the number of CPU cores, close to the expecte maximum throughput with Hyper Threading enabled. not shared, thus increasing the pressure on the memor subsystem: indeed, even if the data values are the sam the model objects are mapped to different memory area Delay Batching: As in Clipper, PRETZEL FrontEnd a lows users to specify a maximum delay they can wait t maximize throughput. For each model category, Figure 1 depicts the trade-off between throughput and latency a the delay increases. Interestingly, for such models even small delay helps reaching the optimal batch size. Figure 12: Throughput and latency of SA and RT model W/o Delay batching • ML.NET suffers from missed data sharing, i.e. higher memory traffic • Almost linear scaling W/ Delay batching • Not supported in ML.NET • Helps reaching optimal batch size, while latency degrades gracefully
  • 25. • Load all 300 models upfront • 3 threads to generate requests, 29 to score • One core serves single requests, the others serve batches • ML.NET does not support this scenarios • Latency degrades gracefully with load 22 Heavy load puted among the s each. We scale TZEL scales lin- to the expected latency, instead, gracefully increases linearly with the increase of the load. Figure 13: Throughput and latency of PRETZEL under
  • 27. • We addressed performance/density bottlenecks in ML inference for MaaS • We advocate the adoption of a white-box approach • We are re-thinking ML inference as a DB query problem - buffer/locality issues - code generation - runtime, hardware-aware scheduling 24 Conclusions
  • 28. • Oven has pre-written implementations for logical and physical stages: not maintainable in the long run • We are adding code-generation of stages, maybe with hardware-specific templates [10] • Tune code-generation for more-aggressive optimizations • Support more model formats than ML.NET, like ONNX [11] 25 Off-line phase [10] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010 [11] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017
  • 29. • Distributed version • Main work is on NUMA awareness, to improve locality and scheduling – Schedule on the NUMA node where most shared state resides – Duplicate hot, shared state objects on nodes for lower latency – Scale better, avoiding bottlenecks on coherence bus 26 On-line phase
  • 30. 26 On-line phase Pretzel is currently under submission at OSDI ‘18 QUESTIONS ? • Distributed version • Main work is on NUMA awareness, to improve locality and scheduling – Schedule on the NUMA node where most shared state resides – Duplicate hot, shared state objects on nodes for lower latency – Scale better, avoiding bottlenecks on coherence bus Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it