Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads.

Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
UC Berkeley, 05/23/2018
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus
Weimer, Marco D. Santambrogio, Byung-Gon Chun
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems

ML-as-a-Service
• Trained ML models are often deployed on cloud
platforms as black boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure (= sequence of
operations)
- Similar state (= read-only memory
objects)

Two key requirements in MaaS
1. Performance: latency or throughput
– Limited by data copy, JIT, virtual calls, GC …
2. Model density: number of models per machine
– Heavily affects operational costs
– State objects can be sizeable (100s MB)
– When RAM exhausts spill to disk, killing latency
3
Requirements and limitations

We need to know structure and state:
white-box model
4
Breaking the black-box model

We need to know structure and state:
white-box model
4
Breaking the black-box model
1. To generate an optimised version of a model on
deployment: higher performance
2. To allocate shared state only once, and share among
models: higher density

‣ Opening the black box
‣ Pretzel, white-box serving system
‣ Evaluation
‣ Conclusions and future work
5
Outline

• Models are Direct Acyclic Graphs of transformations:
input parsers, featurizers, concatenation, predictors, …
• We started from 300 production-like models written in
ML.NET [1], in C#
• We looked at their structure and found common
patterns
- Models are constructed incrementally
- often from pre-defined templates
7
Case study
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet

8
Runtime characteristics
• Example is a sentiment analysis pipeline trained on the
Amazon Reviews dataset [2] we analysed previously [3]
• Distinct transformations are distinct calls
• Most of the time is spent in featurization
• No single bottleneck
ck Box of Machine Learning
Serving Systems
per # 355
of
o
g-
h
e
v-
y
a-
nt
a
d
e
s
r-
s
e
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
unique feature vector which is then scored by a Linear
Regression predictor. This is a simpliﬁcation: the actual
DAG contains about 12 operators.
Fig. 2. Execution breakdown of the example model
2 THE CASE FOR ACCELERATING GENERIC PRE-
DICTION PIPELINES
w
v
(0
m
an
w
su
th
w
m
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of
the 25th International Conference on World Wide Web, WWW ’16
[3] A. Scolari, Y. Lee, M. Weimer and M. Interlandi, "Towards Accelerating Generic Machine Learning Prediction Pipelines," 2017 IEEE International
Conference on Computer Design (ICCD)

• Overheads are a remarkable % of runtime
- JIT, GC, virtual calls, …
• Initial (cold) predictions are 100x slower
• Operators have different buffers and
break locality
• Operators are executed one at a time:
no code fusion, multiple data accesses
9
Limitations of black-box(a) We identify the operators by their parameters. The ﬁ
operators, which have multiple versions with different len
Figure 3: Probability for an operator within the 250
250⇥ y
100 pipelines have operator x, implying that pi
Figure 4: CDF of latency of prediction requests of
DAGs. We denote the ﬁrst prediction as cold; the
line is reported as average over 100 predictions aft
warm-up period of 10. The plot is normalized over
99th percentile latency of the hot case.
describes this situation, where the performance of
predictions over the 250 sentiment analysis pipelines
memory already allocated and JIT-compiled code is m
than an order-of-magnitude faster then the cold ver
for the same pipelines. To drill down more into the p
lem, we found that 57.4% of the total execution time f
single cold prediction is spent in pipeline analysis and
tialization of the function chain, 36.5% in JIT compila
and the remaining is actual computation time.
Limited performance

• Overheads are a remarkable % of runtime
- JIT, GC, virtual calls, …
• Initial (cold) predictions are 100x slower
• Operators have different buffers and
break locality
• Operators are executed one at a time:
no code fusion, multiple data accesses
9
Limitations of black-box(a) We identify the operators by their parameters. The fi
operators, which have multiple versions with different len
Figure 3: Probability for an operator within the 250
250⇥ y
100 pipelines have operator x, implying that pi
Figure 4: CDF of latency of prediction requests of
DAGs. We denote the first prediction as cold; the
line is reported as average over 100 predictions aft
warm-up period of 10. The plot is normalized over
99th percentile latency of the hot case.
describes this situation, where the performance of
predictions over the 250 sentiment analysis pipelines
memory already allocated and JIT-compiled code is m
than an order-of-magnitude faster then the cold ver
for the same pipelines. To drill down more into the p
lem, we found that 57.4% of the total execution time f
single cold prediction is spent in pipeline analysis and
tialization of the function chain, 36.5% in JIT compila
and the remaining is actual computation time.
(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
Figure 3: Probability for an operator within the 250 different pipelines. If an operato
y
• No state sharing: state is
duplicated-> memory is
wasted
Limited performance
Limited density

• Optimisations for single operators, like DNNs [4-5]
• TensorFlow Serving [6] as Servable Python objects
• ML.NET as zip files with state files and DLLs
• Clipper [7] and Rafiki [8] deploy pipelines as Docker
containers
– They schedule requests based on latency target
– Can apply caching and batching
• MauveDB [9] accepts regression and interpolation model and
optimises them as DB views
10
Related work
[4] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[5] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017
[6] C.Olston, et al., and V. Rajashekhar. Tensorflow- serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017
[7] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017
[8] W.Wang,S.Wang,J.Gao, et al.. Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018
[9] A. Deshpande and S. Madden. Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006

PRETZEL, WHITE-BOX SERVING SYSTEM
11

White Box Prediction Serving: make pipelines co-
exist better, schedule better
1. End-to-end Optimizations: merge operators in
computational units (logical stages), to
decrease overheads
2. Multi-model Optimizations: create once, use
everywhere, for both data and stages
12
Design principles

Flour is a language-interpreted API to express ML pipeline in
dataflow style, like Spark or LINQ (currently only ML.NET)
13
Off-line phase [1] - Flour
ages the dynamic decisions on how to schedule plans
based on machine workload. Finally, a FrontEnd is used
to submit prediction requests to the system.
Model pipeline deployment and serving in PRETZEL
follow a two-phase process. During the off-line phase
(Section 4.1), ML2’s pre-trained pipelines are translated
into Flour transformation-based API. Oven optimizer re-
arranges and fuses transformations into model plans com-
posed of parameterized logical units called stages. Each
logical stage is then AOT compiled into physical computa-
tion units where memory resources and threads are pooled
at runtime. Model plans are registered for prediction serv-
ing in the Runtime where physical stages and parameters
are shared between pipelines with similar model plans. In
the on-line phase (Section 4.2), when an inference request
for a registered model plan is received, physical stages are
parameterized dynamically with the proper values main-
tained in the Object Store. The Scheduler is in charge of
binding physical stages to shared execution units.
Figures 6 and 7 pictorially summarize the above de-
scriptions; note that only the on-line phase is executed
at inference time, whereas the model plans are generated
completely off-line. Next, we will describe each layer
composing the PRETZEL prediction system.
4.1 Off-line Phase
4.1.1 Flour
The goal of Flour is to provide an intermediate represen-
tation between ML frameworks (currently only ML2) and
PRETZEL that is both easy to target and amenable to op-
timizations. Once a pipeline is ported into Flour, it can
be optimized and compiled (Section 4.1.2) into a model
arrays indicate the number and type of input fields. The
successive call to Tokenize in line 4 splits the input
fields into tokens. Lines 6 and 7 contain the two branches
defining the char-level and word-level n-gram transforma-
tions, which are then merged with the Concat transform
in line 9 before the linear binary classifier of line 10. Both
char and word n-gram transformations are parametrized
by the number of n-grams and maps translating n-grams
into numerical format (not shown in the Figure). Addi-
tionally, each Flour transformation accepts as input an
optional set of statistics gathered from training. These
statistics are used by the compiler to generate physical
plans more efficiently tailored to the model characteristics.
Example statistics are max vector size (to define the mini-
mum size of vectors to fetch from the pool at prediction
time 4.2), dense or sparse representations, etc.
Listing 1: Flour program for the sentiment analysis
pipeline. Transformations’ parameters are extracted from
the original ML2 pipeline.
1 var fContext = new FlourContext(objectStore, ...)
2 var tTokenizer = fContext.CSV.
3 FromText(fields, fieldsType,
sep).
4 Tokenize();
5
6 var tCNgram = tTokenizer.CharNgram(numCNgrms,
...);
7 var tWNgram = tTokenizer.WordNgram(numWNgrms,
...);
8 var fPrgrm = tCNgram.
9 Concat(tWNgram).
10 ClassifierBinaryLinear(cParams);
11
12 return fPrgrm.Plan();
We have instrumented the ML2 library to collect statis-
tics from training and with the related bindings to the
Object Store and Flour to automatically extract Flour
e Black Box of Machine Learning
ction Serving Systems
Paper # 355
mposed of
gn allows to
at training-
ments such
erformance
iction serv-
s, whereby
nored in fa-
we present
roducing a
end-to-end
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
execution code of different physical im
physical implementation is selected ba
ters characterizing a logical stage and
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
ters characterizing a logical stage and available statisti
Plan compilation is a two step process. After the sta
DAG is generated by the Oven Optimizer, the Mod
Plan Compiler (MPC) maps each stage into its logic
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
optimizer
Store (Se
MPC trav
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
DAG is generated by the
Plan Compiler (MPC) ma
representation containing a
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
statistics. A model plan is the union of all the element
(such as most featurizers) are pipelined together in a sin
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
vectors or matrices multiplications) are executed one-at
a-time so that Single Instruction, Multiple Data (SIMD
vectorization can be exploited, therefore optimizing the
Flour API
Oven
Optimiser
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
(such as most featurizers) are pipelined together in a sin-
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-

14
Off-line phase [2] - Oven
[9] A.Crotty,A.Galakatos,K.Dursun, et al. An architecture for compiling UDF-centric workflows. PVLDB, 8(12):1466–1477, Aug. 2015
• Oven optimises pipelines much like queries
• Tupleware’s hybrid approach [9]
– memory-intensive operators are merged into a logical stage,
to increase data locality
– Compute-intensive operators stay alone, to leverage SIMD
– One-to-one operators are chained together, while one-to-
many can break locality
• Oven’s Model Plan Compiler compiles logical stages into
physical stages, based on parameters and statistics

• Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources:
– Executor threads
– Per-executor vector pools for buffers
– Accepts requests in batches or request-response mode
• Object Store caches objects of all pipelines
– During initialisation, a pipelines registers its objects
– During inference, it retrieves them from OS
– allows controlling locality of objects, replication, …
15
On-line phase [1] - Runtime

• Event-based scheduling: each physical stage to be run
corresponds to an event
– First stage of new requests has low priority
– Following stages have high priority
• Buffer vectors are allocated lazily at the beginning of
the request serving
– Different priorities help latency and buffer locality
– On prediction end, buffers are recycled
16
On-line phase [2] - Scheduler

• Two model classes written in ML.NET, running in ML.NET and Pretzel
– 250 Sentiment-Analysis models
– 50 Regression Task models
• Testbed representing a small production server
– 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT enabled
– 32 GB RAM
– Windows 10
• Four scenarios:
– Memory: evaluate models memory density thanks to state sharing
– Latency: evaluate request-response latency thanks to optimisations
– Batch: evaluate scheduler effectiveness
– Heavy-load: evaluate mixed scenarios with latency degradation
18
Workload and testbed

• 7x less memory for SA models, 40% for RT models
• Higher model density, higher efficiency and
profitability
19
Memory
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
the maximum usage with Object Store for SA models.
Figure 9: Latency co
ZEL, with values nor
Cliffs indicate pipeli

• Hot vs Cold scenarios
• Without and with Caching of partial results
20
Latency
Figure 9: Latency comparison between ML2 and PRET-
Cumulative memory usage of the model
ith and without Object Store, normalized by
um usage with Object Store for SA models.
common objects, we can decrease memory
to 7 ⇥ in SA and around 40% for RT models.
ck of plan parameters can also help reduce the
ding models: as expected PRETZEL can load
7.4 times faster compared to ML2.
ency
riment we study the latency behavior of PRET-
Figure 9: Latency comparison between ML2 and PRE
ZEL, with values normalized over ML2’s P99hot latenc
Cliffs indicate pipelines with big differences in latency.
Figure 10: Latency of PRETZEL to run SA models wit
W/o Caching W/o and w/ Caching
• ML.NET cold: 10-300x w.r.t. hot
• Pretzel cold: 3-20x slower than hot
• Pretzel vs ML:NET: cold is 15x faster,
hot is 2.6x
• For single models runs, average
improvement is 2.5x
• More gain with multiple requests

• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given delay
- Batch while latency is smaller than tolerated delay
21
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.
the number of cores on the x-axis. PRETZEL scales lin
early to the number of CPU cores, close to the expecte
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memor
subsystem: indeed, even if the data values are the sam
the model objects are mapped to different memory area
Delay Batching: As in Clipper, PRETZEL FrontEnd a
lows users to specify a maximum delay they can wait t
maximize throughput. For each model category, Figure 1
depicts the trade-off between throughput and latency a
the delay increases. Interestingly, for such models even
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT model
W/o Delay batching
• ML.NET suffers from missed data
sharing, i.e. higher memory traffic
• Almost linear scaling
W/ Delay batching
• Not supported in ML.NET
• Helps reaching optimal batch size,
while latency degrades gracefully

• Load all 300 models upfront
• 3 threads to generate requests, 29 to score
• One core serves single requests, the others serve batches
• ML.NET does not support this scenarios
• Latency degrades gracefully with load
22
Heavy load
puted among the
s each. We scale
TZEL scales lin-
to the expected
latency, instead, gracefully increases linearly with the
increase of the load.
Figure 13: Throughput and latency of PRETZEL under

CONCLUSIONS AND FUTURE WORK
23

• We addressed performance/density bottlenecks in ML
inference for MaaS
• We advocate the adoption of a white-box approach
• We are re-thinking ML inference as a DB query problem
- buffer/locality issues
- code generation
- runtime, hardware-aware scheduling
24
Conclusions

• Oven has pre-written implementations for logical and
physical stages: not maintainable in the long run
• We are adding code-generation of stages, maybe with
hardware-specific templates [10]
• Tune code-generation for more-aggressive optimizations
• Support more model formats than ML.NET, like ONNX [11]
25
Off-line phase
[10] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010
[11] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017

• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
26
On-line phase

26
On-line phase
Pretzel is currently under submission at OSDI ‘18
QUESTIONS ?
• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads.

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads.

Similar to Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads. (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads.