SlideShare a Scribd company logo
1 of 22
Download to read offline
An	Uncertainty-Aware	Approach	to	Optimal	
Configuration	of	Stream	Processing	Systems	
Pooyan Jamshidi, Giuliano Casale
Imperial College London
p.jamshidi@imperial.ac.uk
MASCOTS
London
19 Sept 2016
Motivation
1- Many different
Parameters =>
- large state space
- interactions
2- Defaults are
typically used =>
- poor performance
Motivation
0 1 2 3 4 5
average read latency (µs) ×104
0
20
40
60
80
100
120
140
160
observations
1000 1200 1400 1600 1800 2000
average read latency (µs)
0
10
20
30
40
50
60
70
observations
1
1
(a) cass-20 (b) cass-10
Best configurations
Worst configurations
Experiments on
Apache Cassandra:
- 6 parameters, 1024 configurations
- Average read latency
- 10 millions records (cass-10)
- 20 millions records (cass-20)
Motivation	(Apache	Storm)
number of counters
number of splitters
latency(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243 684 10125 14166 18
In our experiments we
observed improvement
up to 100%
Goal
is denoted by f(x). Throughout, we assume
ncy, however, other metrics for response may
re consider the problem of finding an optimal
⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
esponse function f(·) is usually unknown or
n, i.e., yi = f(xi), xi ⇢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ✏. The
of the optimal configuration is thus a black-
on program subject to noise [27, 33], which
harder than deterministic optimization. A
n is based on sampling that starts with a
pled configurations. The performance of the
sociated to this initial samples can deliver
tanding of f(·) and guide the generation of
of samples. If properly guided, the process
ration-evaluation-feedback-regeneration will
tinuously, (ii) Big Data systems are d
frameworks (e.g., Apache Hadoop, S
on similar platforms (e.g., cloud clust
versions of a system often share a sim
To the best of our knowledge, only
the possibility of transfer learning in
The authors learn a Bayesian network
of a system and reuse this model fo
systems. However, the learning is lim
the Bayesian network. In this paper,
that not only reuse a model that has b
but also the valuable raw data. There
to the accuracy of the learned model
consider Bayesian networks and inste
2.4 Motivation
A motivating example. We now i
points on an example. WordCount (cf.
benchmark [12]. WordCount features
(Xi). In general, Xi may either indicate (i) integer vari-
such as level of parallelism or (ii) categorical variable
as messaging frameworks or Boolean variable such as
ng timeout. We use the terms parameter and factor in-
angeably; also, with the term option we refer to possible
s that can be assigned to a parameter.
assume that each configuration x 2 X in the configura-
pace X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the
m accepts this configuration and the corresponding test
s in a stable performance behavior. The response with
guration x is denoted by f(x). Throughout, we assume
f(·) is latency, however, other metrics for response may
ed. We here consider the problem of finding an optimal
guration x⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
fact, the response function f(·) is usually unknown or
ally known, i.e., yi = f(xi), xi ⇢ X. In practice, such
it still requires hundr
per, we propose to ad
with the search e ci
than starting the sear
the learned knowledg
software to accelerate
version. This idea is i
in real software engin
in DevOps di↵erent
tinuously, (ii) Big Da
frameworks (e.g., Ap
on similar platforms (
versions of a system o
To the best of our k
the possibility of tran
The authors learn a B
of a system and reus
systems. However, the
the Bayesian network.
his configuration and the corresponding test
le performance behavior. The response with
is denoted by f(x). Throughout, we assume
ncy, however, other metrics for response may
e consider the problem of finding an optimal
⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
esponse function f(·) is usually unknown or
, i.e., yi = f(xi), xi ⇢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ✏. The
of the optimal configuration is thus a black-
n program subject to noise [27, 33], which
harder than deterministic optimization. A
n is based on sampling that starts with a
pled configurations. The performance of the
sociated to this initial samples can deliver
tanding of f(·) and guide the generation of
of samples. If properly guided, the process
ation-evaluation-feedback-regeneration will
erge and the optimal configuration will be
in DevOps di↵erent versions of a system is delivered
tinuously, (ii) Big Data systems are developed using s
frameworks (e.g., Apache Hadoop, Spark, Kafka) an
on similar platforms (e.g., cloud clusters), (iii) and di↵
versions of a system often share a similar business log
To the best of our knowledge, only one study [9] ex
the possibility of transfer learning in system configur
The authors learn a Bayesian network in the tuning p
of a system and reuse this model for tuning other s
systems. However, the learning is limited to the struct
the Bayesian network. In this paper, we introduce a m
that not only reuse a model that has been learned prev
but also the valuable raw data. Therefore, we are not li
to the accuracy of the learned model. Moreover, we d
consider Bayesian networks and instead focus on MTG
2.4 Motivation
A motivating example. We now illustrate the pre
points on an example. WordCount (cf. Figure 1) is a po
benchmark [12]. WordCount features a three-layer arc
ture that counts the number of words in the incoming s
A Processing Element (PE) of type Spout reads the
havior. The response with
). Throughout, we assume
metrics for response may
blem of finding an optimal
nimizes f(·) over X:
f(x) (1)
(·) is usually unknown or
xi ⇢ X. In practice, such
i.e., yi = f(xi) + ✏. The
figuration is thus a black-
t to noise [27, 33], which
ministic optimization. A
mpling that starts with a
. The performance of the
itial samples can deliver
d guide the generation of
perly guided, the process
in DevOps di↵erent versions of a system is delivered con
tinuously, (ii) Big Data systems are developed using simila
frameworks (e.g., Apache Hadoop, Spark, Kafka) and ru
on similar platforms (e.g., cloud clusters), (iii) and di↵eren
versions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] explore
the possibility of transfer learning in system configuration
The authors learn a Bayesian network in the tuning proces
of a system and reuse this model for tuning other simila
systems. However, the learning is limited to the structure o
the Bayesian network. In this paper, we introduce a metho
that not only reuse a model that has been learned previousl
but also the valuable raw data. Therefore, we are not limite
to the accuracy of the learned model. Moreover, we do no
consider Bayesian networks and instead focus on MTGPs.
2.4 Motivation
A motivating example. We now illustrate the previou
points on an example. WordCount (cf. Figure 1) is a popula
benchmark [12]. WordCount features a three-layer archite
Partially known
Measurements subject to noise
Configuration space
Non-linear	interactions
0 5 10 15 20
Number of counters
100
120
140
160
180
200
220
240
Latency(ms)
splitters=2
splitters=3
number of counters
number of splitters
latency(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243 684 10125 14166 18
Response surface is:
- Non-linear
- Non convex
- Multi-modal
The	measurements	are	subject	to	variability
wc wc+rs wc+sol 2wc 2wc+rs+sol
10
1
10
2
Latency(ms)
The scale of
measurement variability
is different in different
deployments
(heteroscedastic noise)
y at points x that has been
here consider the problem
x⇤
that minimizes f over
w experiments as possible:
f(x) (1)
) is usually unknown or
xi ⇢ X. In practice, such
.e., yi = f(xi) + ✏i. Note
ly partially-known, finding
kbox optimization problem
noise. In fact, the problem
on-convex and multi-modal
P-hard [36]. Therefore, on
locate a global optimum,
st possible local optimum
udget.
It shows the non-convexity, multi-modality and the substantial
performance difference between different configurations.
0 5 10 15 20
Number of counters
100
120
140
160
180
200
220
240
Latency(ms)
splitters=2
splitters=3
Fig. 3: WordCount latency, cut though Figure 2.
demonstrates that if one tries to minimize latency by acting
just on one of these parameters at the time, the resulting
BO4CO	architecture
Configuration
Optimisation Tool
performance
repository
Monitoring
Deployment Service
Data Preparation
configuration
parameters
values
configuration
parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment time
polling interval
configuration
parameters
GP model
Kafka
System Under Test
Workload
Generator
Technology Interface
Storm
Cassandra
Spark
GP	for	modeling	blackbox response	function
true function
GP mean
GP variance
observation
selected point
true
minimum
mposed by its prior mean (µ(·) : X ! R) and a covariance
nction (k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
here covariance k(x, x0
) defines the distance between x
d x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
e collection of t experimental data (observations). In this
mework, we treat f(x) as a random variable, conditioned
observations S1:t, which is normally distributed with the
lowing posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
here y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
:= µ(x1:t), K := k(xi, xj) and I is identity matrix. The
ortcoming of BO4CO is that it cannot exploit the observa-
ns regarding other versions of the system and as therefore
nnot be applied in DevOps.
2 TL4CO: an extension to multi-tasks
TL4CO 1
uses MTGPs that exploit observations from other
evious versions of the system under test. Algorithm 1
fines the internal details of TL4CO. As Figure 4 shows,
4CO is an iterative algorithm that uses the learning from
her system versions. In a high-level overview, TL4CO: (i)
ects the most informative past observations (details in
ction 3.3); (ii) fits a model to existing data based on kernel
arning (details in Section 3.4), and (iii) selects the next
ork are based on tractable linear algebra.
evious work [21], we proposed BO4CO that ex-
task GPs (no transfer learning) for prediction of
tribution of response functions. A GP model is
y its prior mean (µ(·) : X ! R) and a covariance
·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
iance k(x, x0
) defines the distance between x
us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
n of t experimental data (observations). In this
we treat f(x) as a random variable, conditioned
ons S1:t, which is normally distributed with the
sterior mean and variance functions [41]:
µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
, K := k(xi, xj) and I is identity matrix. The
of BO4CO is that it cannot exploit the observa-
ng other versions of the system and as therefore
pplied in DevOps.
CO: an extension to multi-tasks
uses MTGPs that exploit observations from other
Motivations:
1- mean estimates + variance
2- all computations are linear algebra
3- good estimations when few data
Sparsity	of	Effects
• Correlation-based
feature selector
• Merit	is	used	to	select	
subsets	that	are	highly	
correlated	with	the	
response	variable	
• At most 2-3 parameters
were strongly interacting
with each other
TABLE I: Sparsity of effects on 5 experiments where we have varied
different subsets of parameters and used different testbeds. Note that
these are the datasets we experimentally measured on the benchmark
systems and we use them for the evaluation, more details including
the results for 6 more experiments are in the appendix.
Topol. Parameters Main factors Merit Size Testbed
1 wc(6D)
1-spouts, 2-max spout,
3-spout wait, 4-splitters,
5-counters, 6-netty min wait
{1, 2, 5} 0.787 2880 C1
2 sol(6D)
1-spouts, 2-max spout,
3-top level, 4-netty min wait,
5-message size, 6-bolts
{1, 2, 3} 0.447 2866 C2
3 rs(6D)
1-spouts, 2-max spout,
3-sorters, 4-emit freq,
5-chunk size, 6-message size
{3} 0.385 3840 C3
4 wc(3D)
1-max spout, 2-splitters,
3-counters {1, 2} 0.480 756 C4
5 wc(5D)
1-spouts, 2-splitters,
3-counters,
4-buffer-size, 5-heap
{1} 0.851 1080 C5
102
s)
Experiments on:
1. C1: OpenNebula (X)
2. C2: Amazon EC2 (Y)
3. C3: OpenNebula (3X)
4. C4: Amazon EC2 (2Y)
5. C5: Microsoft Azure (X)
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
x1 x2 x3 x4
true function
GP surrogate
mean estimate
observation
Fig. 5: An example of 1D GP model: GPs provide mean esti-
mates as well as the uncertainty in estimations, i.e., variance.
Configuration
Optimisation Tool
performance
repository
Monitoring
Deployment Service
Data Preparation
configuration
parameters
values
configuration
parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment time
polling interval
configuration
parameters
GP model
Kafka
System Under Test
Workload
Generator
Technology Interface
Storm
Cassandra
Spark
Algorithm 1 : BO4CO
Input: Configuration space X, Maximum budget Nmax, Re-
sponse function f, Kernel function K✓, Hyper-parameters
✓, Design sample size n, learning cycle Nl
Output: Optimal configurations x⇤
and learned model M
1: choose an initial sparse design (lhd) to find an initial
design samples D = {x1, . . . , xn}
2: obtain performance measurements of the initial design,
yi f(xi) + ✏i, 8xi 2 D
3: S1:n {(xi, yi)}n
i=1; t n + 1
4: M(x|S1:n, ✓) fit a GP model to the design . Eq.(3)
5: while t  Nmax do
6: if (t mod Nl = 0) ✓ learn the kernel hyper-
parameters by maximizing the likelihood
7: find next configuration xt by optimizing the selection
criteria over the estimated response surface given the data,
xt arg maxxu(x|M, S1:t 1) . Eq.(9)
8: obtain performance for the new configuration xt, yt
f(xt) + ✏t
9: Augment the configuration S1:t = {S1:t 1, (xt, yt)}
10: M(x|S1:t, ✓) re-fit a new GP model . Eq.(7)
11: t t + 1
12: end while
13: (x⇤
, y⇤
) = min S1:Nmax
14: M(x)
-1.5 -1 -0.5 0 0.5 1 1.5
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Configuration
Space
Empirical
Model
2
4
6
8
10
12
1
2
3
4
5
6
160
140
120
100
80
60
180
Experiment
(exhastive)
Experiment
Experiment
0 20 40 60 80 100 120 140 160 180 200
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Selection Criteria
(b) Sequential Design
(a) Design of Experiment
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
configuration domain
responsevalue
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
true response function
GP fit
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
criteria evaluation
new selected point
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
new GP fit
Acquisition function:
btaining the measurements
O then fits a GP model to
elief about the underlying
rithm 1). The while loop in
belief until the budget runs
:t = {(xi, yi)}t
i=1, where
a prior distribution Pr(f)
1:t|f) form the posterior
) Pr(f).
ions [37], specified by its
iance (see Section III-E1):
), k(x, x0
)), (3)
where
µt(x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (7)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (8)
These posterior functions are used to select the next point xt+1
as detailed in Section III-C.
C. Configuration selection criteria
The selection criteria is defined as u : X ! R that selects
xt+1 2 X, should f(·) be evaluated next (step 7):
xt+1 = argmax
x2X
u(x|M, S1:t) (9)
Logical
View
Physical
View
pipe
Spout A Bolt A Bolt B
socket socket
out queue in queue
Worker A Worker B Worker C
out queue in queue
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)
[paintings, 3]
[poems, 60]
[letter, 75]
Kafka Topic
Stream to
Kafka
File
(sentence)
(sentence)
(sentence)
Kafka Spout
RollingCount
Bolt
Intermediate
Ranking Bolt
(hashtags)
(hashtag,
count)
Ranking
Bolt
(ranking)
(trending
topics)Kafka Topic
Twitter to
Kafka
(tweet)
Twitter Stream
(tweet)
(tweet)
Storm Architecture
Word Count Architecture
• CPU intensive
Rolling Sort Architecture
• Memory intensive
Applications:
• Fraud detection
• Trending topics
Experimental	results
0 20 40 60 80 100
Iteration
10
-3
10-2
10-1
100
10
1
102
10
3
104
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
0 20 40 60 80 100
Iteration
10
-2
10-1
100
101
102
103
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
(a) WordCount(3D) (b) WordCount(5D)
- 30 runs, report average performance
- Yes, we did full factorial
measurements and we know where
global min is…
Experimental	results
0 50 100 150 200
Iteration
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
0 50 100 150 200
Iteration
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
(a) SOL(6D) (b) RollingSort(6D)
Experimental	results
0 20 40 60 80 100
Iteration
10-4
10
-3
10-2
10-1
100
101
10
2
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
0 20 40 60 80 100
Iteration
10-2
10
-1
10
0
101
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
(a) Branin(2D) (b) Dixon(2D)
Model	accuracy	(comparison	with	polynomial	regression	models)
BO4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5
10
-10
10
-8
10
-6
10
-4
10
-2
10
0
10
2
AbsolutePercentageError[%]
Prediction	accuracy	over	time
0 10 20 30 40 50 60 70 80
Iteration
10
1
10
2
10
3
PredictionError
BO4CO
polyfit1
M5Tree
RegressionTree
M5Rules
LWP(GAU)
PRIM
Exploitation	vs	exploration
0 20 40 60 80 100
Iteration
10
-4
10
-3
10
-2
10
-1
10
0
10
1
10
2
AbsoluteError
BO4CO(adaptive)
BO4CO(µ:=0)
BO4CO(κ:=0.1)
BO4CO(κ:=1)
BO4CO(κ:=6)
BO4CO(κ:=8)
0 2000 4000 6000 8000 10000
Iteration
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
Kappa
ϵ=1
ϵ=0.1
ϵ=0.01
the next configuration to measure. Intuitively,
lect the minimum response. This is done using
ction u : X ! R that determines xt+1 2 X,
e evaluated next as:
xt+1 = argmax
x2X
u(x|M, S1
1:t) (11)
on criterion depends on the MTGP model M
h its predictive mean µt(xt) and variance 2
t (xt)
on observations S1
1:t. TL4CO uses the Lower
ound (LCB) [24]:
B(x|M, S1
1:t) = argmin
x2X
µt(x)  t(x), (12)
xploitation-exploration parameter. For instance,
to find a near optimal configuration we set a
 to take the most out of the predictive mean.
e are looking for a globally optimum one, we can
ue in order to skip local minima. Furthermore,
ted over time [22] to perform more explorations.
ws that in TL4CO,  can start with a relatively
at the early iterations comparing to BO4CO
mer provides a better estimate of mean and
xt+1 = argmax
x2X
u(x|M, S1
1:t) (11)
e selection criterion depends on the MTGP model M
through its predictive mean µt(xt) and variance 2
t (xt)
tioned on observations S1
1:t. TL4CO uses the Lower
dence Bound (LCB) [24]:
uLCB(x|M, S1
1:t) = argmin
x2X
µt(x)  t(x), (12)
 is a exploitation-exploration parameter. For instance,
require to find a near optimal configuration we set a
alue to  to take the most out of the predictive mean.
ver, if we are looking for a globally optimum one, we can
high value in order to skip local minima. Furthermore,
be adapted over time [22] to perform more explorations.
e 6 shows that in TL4CO,  can start with a relatively
r value at the early iterations comparing to BO4CO
the former provides a better estimate of mean and
ore contains more information at the early stages.
4CO output. Once the Nmax di↵erent configurations of
ystem under test are measured, the TL4CO algorithm
nates. Finally, TL4CO produces the outputs including
ptimal configuration (step 14 in Algorithm 1) as well
Runtime	overhead
0 20 40 60 80 100
Iteration
0.15
0.2
0.25
0.3
0.35
0.4
ElapsedTime(s)
WordCount (3D)
WordCount (6D)
SOL (6D)
RollingSort (6D)
WordCount (5D)
- The computation time in larger
datasets is higher than those with
less data and lower.
- The computation time increases
over time since the matrix size for
Cholesky inversion gets larger.
mean is shown in yellow and the 95% confidence interval at
each point in the shaded red area. The stars indicate ex-
perimental measurements (or observation interchangeably).
Some points x 2 X have a large confidence interval due to
lack of observations in their neighborhood, while others have
a narrow confidence. The main motivation behind the choice
of Bayesian Optimization here is that it o↵ers a framework
in which reasoning can be not only based on mean estimates
but also the variance, providing more informative decision
making. The other reason is that all the computations in
this framework are based on tractable linear algebra.
In our previous work [21], we proposed BO4CO that ex-
ploits single-task GPs (no transfer learning) for prediction of
posterior distribution of response functions. A GP model is
composed by its prior mean (µ(·) : X ! R) and a covariance
function (k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
where covariance k(x, x0
) defines the distance between x
and x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
the collection of t experimental data (observations). In this
framework, we treat f(x) as a random variable, conditioned
on observations S1:t, which is normally distributed with the
following posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
where y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
n approach using a 1-dimensional response. The
blue is the unknown true response, whereas the
hown in yellow and the 95% confidence interval at
t in the shaded red area. The stars indicate ex-
al measurements (or observation interchangeably).
nts x 2 X have a large confidence interval due to
servations in their neighborhood, while others have
confidence. The main motivation behind the choice
an Optimization here is that it o↵ers a framework
easoning can be not only based on mean estimates
he variance, providing more informative decision
The other reason is that all the computations in
ework are based on tractable linear algebra.
previous work [21], we proposed BO4CO that ex-
le-task GPs (no transfer learning) for prediction of
distribution of response functions. A GP model is
by its prior mean (µ(·) : X ! R) and a covariance
k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
variance k(x, x0
) defines the distance between x
Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
tion of t experimental data (observations). In this
k, we treat f(x) as a random variable, conditioned
ations S1:t, which is normally distributed with the
Key	takeaways
- A	principled	way	for	
- locating optimal configurations
- carefully choosing where to sample
- by sequentially reducing uncertainty in the response surface
- in order to reduce the number of measurements
- GPs appeared	to	be	more	accurate	than	other	machine	
learning	regression	models	in	our	experiments.
- Applications	to	SPSs,	Batch and	NoSQL (see	the	BO4CO	
github page	for	more	details)
Acknowledgement: BO4CO as a part of DevOps pipeline in H2020
DICE
Big	Data	Technologies
Cloud	(Priv/Pub)
`
DICE	IDE
Profile
Plugins
Sim Ver Opt
DPIM
DTSM
DDSM																			TOSCAMethodology
Deploy Config Test
M
o
n
Anomaly
Trace
Iter.	Enh.
Data	Intensive	Application	(DIA)
Cont.Int. Fault	Inj.
WP4
WP3
WP2
WP5
WP1 WP6	- Demonstrators
Code and data: https://github.com/dice-project/DICE-Configuration-BO4CO

More Related Content

What's hot

모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
r-kor
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
Richard Ashworth
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
taeseon ryu
 

What's hot (20)

모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
E01113138
E01113138E01113138
E01113138
 
Optimal buffer allocation in
Optimal buffer allocation inOptimal buffer allocation in
Optimal buffer allocation in
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clustering
 
Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
Build your own Convolutional Neural Network CNN
Build your own Convolutional Neural Network CNNBuild your own Convolutional Neural Network CNN
Build your own Convolutional Neural Network CNN
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
 
Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approach
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門
 

Viewers also liked

Self learning cloud controllers
Self learning cloud controllersSelf learning cloud controllers
Self learning cloud controllers
Pooyan Jamshidi
 

Viewers also liked (8)

Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
 
Cloud Migration Patterns: A Multi-Cloud Architectural Perspective
Cloud Migration Patterns: A Multi-Cloud Architectural PerspectiveCloud Migration Patterns: A Multi-Cloud Architectural Perspective
Cloud Migration Patterns: A Multi-Cloud Architectural Perspective
 
Autonomic Resource Provisioning for Cloud-Based Software
Autonomic Resource Provisioning for Cloud-Based SoftwareAutonomic Resource Provisioning for Cloud-Based Software
Autonomic Resource Provisioning for Cloud-Based Software
 
Towards Quality-Aware Development of Big Data Applications with DICE
Towards Quality-Aware Development of Big Data Applications with DICETowards Quality-Aware Development of Big Data Applications with DICE
Towards Quality-Aware Development of Big Data Applications with DICE
 
Configuration Optimization Tool
Configuration Optimization ToolConfiguration Optimization Tool
Configuration Optimization Tool
 
Sensitivity Analysis for Building Adaptive Robotic Software
Sensitivity Analysis for Building Adaptive Robotic SoftwareSensitivity Analysis for Building Adaptive Robotic Software
Sensitivity Analysis for Building Adaptive Robotic Software
 
Self learning cloud controllers
Self learning cloud controllersSelf learning cloud controllers
Self learning cloud controllers
 
How to Get My Paper Accepted at Top Software Engineering Conferences
How to Get My Paper Accepted at Top Software Engineering ConferencesHow to Get My Paper Accepted at Top Software Engineering Conferences
How to Get My Paper Accepted at Top Software Engineering Conferences
 

Similar to An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 

Similar to An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems (20)

Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer Design
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
A Combinatorial View Of The Service Rates Of Codes Problem, Its Equivalence T...
A Combinatorial View Of The Service Rates Of Codes Problem, Its Equivalence T...A Combinatorial View Of The Service Rates Of Codes Problem, Its Equivalence T...
A Combinatorial View Of The Service Rates Of Codes Problem, Its Equivalence T...
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
paper
paperpaper
paper
 
I046850
I046850I046850
I046850
 
Dycops2019
Dycops2019 Dycops2019
Dycops2019
 
Adaptive check-pointing and replication strategy to tolerate faults in comput...
Adaptive check-pointing and replication strategy to tolerate faults in comput...Adaptive check-pointing and replication strategy to tolerate faults in comput...
Adaptive check-pointing and replication strategy to tolerate faults in comput...
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
 
PhysRevE.89.042911
PhysRevE.89.042911PhysRevE.89.042911
PhysRevE.89.042911
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 

More from Pooyan Jamshidi

Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery ApproachLearning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Pooyan Jamshidi
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Pooyan Jamshidi
 
Transfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable SoftwareTransfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable Software
Pooyan Jamshidi
 

More from Pooyan Jamshidi (16)

Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery ApproachLearning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
 
A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
 A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn... A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
 
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 
Transfer Learning for Performance Analysis of Machine Learning Systems
Transfer Learning for Performance Analysis of Machine Learning SystemsTransfer Learning for Performance Analysis of Machine Learning Systems
Transfer Learning for Performance Analysis of Machine Learning Systems
 
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 
Learning to Sample
Learning to SampleLearning to Sample
Learning to Sample
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of Robots
 
Transfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable SoftwareTransfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable Software
 
Architectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based SoftwareArchitectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based Software
 
Production-Ready Machine Learning for the Software Architect
Production-Ready Machine Learning for the Software ArchitectProduction-Ready Machine Learning for the Software Architect
Production-Ready Machine Learning for the Software Architect
 
Architecting for Scale
Architecting for ScaleArchitecting for Scale
Architecting for Scale
 
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
 
Fuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringFuzzy Control meets Software Engineering
Fuzzy Control meets Software Engineering
 
Autonomic Resource Provisioning for Cloud-Based Software
Autonomic Resource Provisioning for Cloud-Based SoftwareAutonomic Resource Provisioning for Cloud-Based Software
Autonomic Resource Provisioning for Cloud-Based Software
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 

An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems

  • 1. An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems Pooyan Jamshidi, Giuliano Casale Imperial College London p.jamshidi@imperial.ac.uk MASCOTS London 19 Sept 2016
  • 2. Motivation 1- Many different Parameters => - large state space - interactions 2- Defaults are typically used => - poor performance
  • 3. Motivation 0 1 2 3 4 5 average read latency (µs) ×104 0 20 40 60 80 100 120 140 160 observations 1000 1200 1400 1600 1800 2000 average read latency (µs) 0 10 20 30 40 50 60 70 observations 1 1 (a) cass-20 (b) cass-10 Best configurations Worst configurations Experiments on Apache Cassandra: - 6 parameters, 1024 configurations - Average read latency - 10 millions records (cass-10) - 20 millions records (cass-20)
  • 4. Motivation (Apache Storm) number of counters number of splitters latency(ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 243 684 10125 14166 18 In our experiments we observed improvement up to 100%
  • 5. Goal is denoted by f(x). Throughout, we assume ncy, however, other metrics for response may re consider the problem of finding an optimal ⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) esponse function f(·) is usually unknown or n, i.e., yi = f(xi), xi ⇢ X. In practice, such may contain noise, i.e., yi = f(xi) + ✏. The of the optimal configuration is thus a black- on program subject to noise [27, 33], which harder than deterministic optimization. A n is based on sampling that starts with a pled configurations. The performance of the sociated to this initial samples can deliver tanding of f(·) and guide the generation of of samples. If properly guided, the process ration-evaluation-feedback-regeneration will tinuously, (ii) Big Data systems are d frameworks (e.g., Apache Hadoop, S on similar platforms (e.g., cloud clust versions of a system often share a sim To the best of our knowledge, only the possibility of transfer learning in The authors learn a Bayesian network of a system and reuse this model fo systems. However, the learning is lim the Bayesian network. In this paper, that not only reuse a model that has b but also the valuable raw data. There to the accuracy of the learned model consider Bayesian networks and inste 2.4 Motivation A motivating example. We now i points on an example. WordCount (cf. benchmark [12]. WordCount features (Xi). In general, Xi may either indicate (i) integer vari- such as level of parallelism or (ii) categorical variable as messaging frameworks or Boolean variable such as ng timeout. We use the terms parameter and factor in- angeably; also, with the term option we refer to possible s that can be assigned to a parameter. assume that each configuration x 2 X in the configura- pace X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the m accepts this configuration and the corresponding test s in a stable performance behavior. The response with guration x is denoted by f(x). Throughout, we assume f(·) is latency, however, other metrics for response may ed. We here consider the problem of finding an optimal guration x⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) fact, the response function f(·) is usually unknown or ally known, i.e., yi = f(xi), xi ⇢ X. In practice, such it still requires hundr per, we propose to ad with the search e ci than starting the sear the learned knowledg software to accelerate version. This idea is i in real software engin in DevOps di↵erent tinuously, (ii) Big Da frameworks (e.g., Ap on similar platforms ( versions of a system o To the best of our k the possibility of tran The authors learn a B of a system and reus systems. However, the the Bayesian network. his configuration and the corresponding test le performance behavior. The response with is denoted by f(x). Throughout, we assume ncy, however, other metrics for response may e consider the problem of finding an optimal ⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) esponse function f(·) is usually unknown or , i.e., yi = f(xi), xi ⇢ X. In practice, such may contain noise, i.e., yi = f(xi) + ✏. The of the optimal configuration is thus a black- n program subject to noise [27, 33], which harder than deterministic optimization. A n is based on sampling that starts with a pled configurations. The performance of the sociated to this initial samples can deliver tanding of f(·) and guide the generation of of samples. If properly guided, the process ation-evaluation-feedback-regeneration will erge and the optimal configuration will be in DevOps di↵erent versions of a system is delivered tinuously, (ii) Big Data systems are developed using s frameworks (e.g., Apache Hadoop, Spark, Kafka) an on similar platforms (e.g., cloud clusters), (iii) and di↵ versions of a system often share a similar business log To the best of our knowledge, only one study [9] ex the possibility of transfer learning in system configur The authors learn a Bayesian network in the tuning p of a system and reuse this model for tuning other s systems. However, the learning is limited to the struct the Bayesian network. In this paper, we introduce a m that not only reuse a model that has been learned prev but also the valuable raw data. Therefore, we are not li to the accuracy of the learned model. Moreover, we d consider Bayesian networks and instead focus on MTG 2.4 Motivation A motivating example. We now illustrate the pre points on an example. WordCount (cf. Figure 1) is a po benchmark [12]. WordCount features a three-layer arc ture that counts the number of words in the incoming s A Processing Element (PE) of type Spout reads the havior. The response with ). Throughout, we assume metrics for response may blem of finding an optimal nimizes f(·) over X: f(x) (1) (·) is usually unknown or xi ⇢ X. In practice, such i.e., yi = f(xi) + ✏. The figuration is thus a black- t to noise [27, 33], which ministic optimization. A mpling that starts with a . The performance of the itial samples can deliver d guide the generation of perly guided, the process in DevOps di↵erent versions of a system is delivered con tinuously, (ii) Big Data systems are developed using simila frameworks (e.g., Apache Hadoop, Spark, Kafka) and ru on similar platforms (e.g., cloud clusters), (iii) and di↵eren versions of a system often share a similar business logic. To the best of our knowledge, only one study [9] explore the possibility of transfer learning in system configuration The authors learn a Bayesian network in the tuning proces of a system and reuse this model for tuning other simila systems. However, the learning is limited to the structure o the Bayesian network. In this paper, we introduce a metho that not only reuse a model that has been learned previousl but also the valuable raw data. Therefore, we are not limite to the accuracy of the learned model. Moreover, we do no consider Bayesian networks and instead focus on MTGPs. 2.4 Motivation A motivating example. We now illustrate the previou points on an example. WordCount (cf. Figure 1) is a popula benchmark [12]. WordCount features a three-layer archite Partially known Measurements subject to noise Configuration space
  • 6. Non-linear interactions 0 5 10 15 20 Number of counters 100 120 140 160 180 200 220 240 Latency(ms) splitters=2 splitters=3 number of counters number of splitters latency(ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 243 684 10125 14166 18 Response surface is: - Non-linear - Non convex - Multi-modal
  • 7. The measurements are subject to variability wc wc+rs wc+sol 2wc 2wc+rs+sol 10 1 10 2 Latency(ms) The scale of measurement variability is different in different deployments (heteroscedastic noise) y at points x that has been here consider the problem x⇤ that minimizes f over w experiments as possible: f(x) (1) ) is usually unknown or xi ⇢ X. In practice, such .e., yi = f(xi) + ✏i. Note ly partially-known, finding kbox optimization problem noise. In fact, the problem on-convex and multi-modal P-hard [36]. Therefore, on locate a global optimum, st possible local optimum udget. It shows the non-convexity, multi-modality and the substantial performance difference between different configurations. 0 5 10 15 20 Number of counters 100 120 140 160 180 200 220 240 Latency(ms) splitters=2 splitters=3 Fig. 3: WordCount latency, cut though Figure 2. demonstrates that if one tries to minimize latency by acting just on one of these parameters at the time, the resulting
  • 8. BO4CO architecture Configuration Optimisation Tool performance repository Monitoring Deployment Service Data Preparation configuration parameters values configuration parameters values Experimental Suite Testbed Doc Data Broker Tester experiment time polling interval configuration parameters GP model Kafka System Under Test Workload Generator Technology Interface Storm Cassandra Spark
  • 9. GP for modeling blackbox response function true function GP mean GP variance observation selected point true minimum mposed by its prior mean (µ(·) : X ! R) and a covariance nction (k(·, ·) : X ⇥ X ! R) [41]: y = f(x) ⇠ GP(µ(x), k(x, x0 )), (2) here covariance k(x, x0 ) defines the distance between x d x0 . Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be e collection of t experimental data (observations). In this mework, we treat f(x) as a random variable, conditioned observations S1:t, which is normally distributed with the lowing posterior mean and variance functions [41]: µt(x) = µ(x) + k(x)| (K + 2 I) 1 (y µ) (3) 2 t (x) = k(x, x) + 2 I k(x)| (K + 2 I) 1 k(x) (4) here y := y1:t, k(x)| = [k(x, x1) k(x, x2) . . . k(x, xt)], := µ(x1:t), K := k(xi, xj) and I is identity matrix. The ortcoming of BO4CO is that it cannot exploit the observa- ns regarding other versions of the system and as therefore nnot be applied in DevOps. 2 TL4CO: an extension to multi-tasks TL4CO 1 uses MTGPs that exploit observations from other evious versions of the system under test. Algorithm 1 fines the internal details of TL4CO. As Figure 4 shows, 4CO is an iterative algorithm that uses the learning from her system versions. In a high-level overview, TL4CO: (i) ects the most informative past observations (details in ction 3.3); (ii) fits a model to existing data based on kernel arning (details in Section 3.4), and (iii) selects the next ork are based on tractable linear algebra. evious work [21], we proposed BO4CO that ex- task GPs (no transfer learning) for prediction of tribution of response functions. A GP model is y its prior mean (µ(·) : X ! R) and a covariance ·, ·) : X ⇥ X ! R) [41]: y = f(x) ⇠ GP(µ(x), k(x, x0 )), (2) iance k(x, x0 ) defines the distance between x us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be n of t experimental data (observations). In this we treat f(x) as a random variable, conditioned ons S1:t, which is normally distributed with the sterior mean and variance functions [41]: µ(x) + k(x)| (K + 2 I) 1 (y µ) (3) k(x, x) + 2 I k(x)| (K + 2 I) 1 k(x) (4) 1:t, k(x)| = [k(x, x1) k(x, x2) . . . k(x, xt)], , K := k(xi, xj) and I is identity matrix. The of BO4CO is that it cannot exploit the observa- ng other versions of the system and as therefore pplied in DevOps. CO: an extension to multi-tasks uses MTGPs that exploit observations from other Motivations: 1- mean estimates + variance 2- all computations are linear algebra 3- good estimations when few data
  • 10. Sparsity of Effects • Correlation-based feature selector • Merit is used to select subsets that are highly correlated with the response variable • At most 2-3 parameters were strongly interacting with each other TABLE I: Sparsity of effects on 5 experiments where we have varied different subsets of parameters and used different testbeds. Note that these are the datasets we experimentally measured on the benchmark systems and we use them for the evaluation, more details including the results for 6 more experiments are in the appendix. Topol. Parameters Main factors Merit Size Testbed 1 wc(6D) 1-spouts, 2-max spout, 3-spout wait, 4-splitters, 5-counters, 6-netty min wait {1, 2, 5} 0.787 2880 C1 2 sol(6D) 1-spouts, 2-max spout, 3-top level, 4-netty min wait, 5-message size, 6-bolts {1, 2, 3} 0.447 2866 C2 3 rs(6D) 1-spouts, 2-max spout, 3-sorters, 4-emit freq, 5-chunk size, 6-message size {3} 0.385 3840 C3 4 wc(3D) 1-max spout, 2-splitters, 3-counters {1, 2} 0.480 756 C4 5 wc(5D) 1-spouts, 2-splitters, 3-counters, 4-buffer-size, 5-heap {1} 0.851 1080 C5 102 s) Experiments on: 1. C1: OpenNebula (X) 2. C2: Amazon EC2 (Y) 3. C3: OpenNebula (3X) 4. C4: Amazon EC2 (2Y) 5. C5: Microsoft Azure (X)
  • 11. -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 x1 x2 x3 x4 true function GP surrogate mean estimate observation Fig. 5: An example of 1D GP model: GPs provide mean esti- mates as well as the uncertainty in estimations, i.e., variance. Configuration Optimisation Tool performance repository Monitoring Deployment Service Data Preparation configuration parameters values configuration parameters values Experimental Suite Testbed Doc Data Broker Tester experiment time polling interval configuration parameters GP model Kafka System Under Test Workload Generator Technology Interface Storm Cassandra Spark Algorithm 1 : BO4CO Input: Configuration space X, Maximum budget Nmax, Re- sponse function f, Kernel function K✓, Hyper-parameters ✓, Design sample size n, learning cycle Nl Output: Optimal configurations x⇤ and learned model M 1: choose an initial sparse design (lhd) to find an initial design samples D = {x1, . . . , xn} 2: obtain performance measurements of the initial design, yi f(xi) + ✏i, 8xi 2 D 3: S1:n {(xi, yi)}n i=1; t n + 1 4: M(x|S1:n, ✓) fit a GP model to the design . Eq.(3) 5: while t  Nmax do 6: if (t mod Nl = 0) ✓ learn the kernel hyper- parameters by maximizing the likelihood 7: find next configuration xt by optimizing the selection criteria over the estimated response surface given the data, xt arg maxxu(x|M, S1:t 1) . Eq.(9) 8: obtain performance for the new configuration xt, yt f(xt) + ✏t 9: Augment the configuration S1:t = {S1:t 1, (xt, yt)} 10: M(x|S1:t, ✓) re-fit a new GP model . Eq.(7) 11: t t + 1 12: end while 13: (x⇤ , y⇤ ) = min S1:Nmax 14: M(x) -1.5 -1 -0.5 0 0.5 1 1.5 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Configuration Space Empirical Model 2 4 6 8 10 12 1 2 3 4 5 6 160 140 120 100 80 60 180 Experiment (exhastive) Experiment Experiment 0 20 40 60 80 100 120 140 160 180 200 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Selection Criteria (b) Sequential Design (a) Design of Experiment
  • 12. -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 configuration domain responsevalue -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 true response function GP fit -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 criteria evaluation new selected point -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 new GP fit Acquisition function: btaining the measurements O then fits a GP model to elief about the underlying rithm 1). The while loop in belief until the budget runs :t = {(xi, yi)}t i=1, where a prior distribution Pr(f) 1:t|f) form the posterior ) Pr(f). ions [37], specified by its iance (see Section III-E1): ), k(x, x0 )), (3) where µt(x) = µ(x) + k(x)| (K + 2 I) 1 (y µ) (7) 2 t (x) = k(x, x) + 2 I k(x)| (K + 2 I) 1 k(x) (8) These posterior functions are used to select the next point xt+1 as detailed in Section III-C. C. Configuration selection criteria The selection criteria is defined as u : X ! R that selects xt+1 2 X, should f(·) be evaluated next (step 7): xt+1 = argmax x2X u(x|M, S1:t) (9)
  • 13. Logical View Physical View pipe Spout A Bolt A Bolt B socket socket out queue in queue Worker A Worker B Worker C out queue in queue Kafka Spout Splitter Bolt Counter Bolt (sentence) (word) [paintings, 3] [poems, 60] [letter, 75] Kafka Topic Stream to Kafka File (sentence) (sentence) (sentence) Kafka Spout RollingCount Bolt Intermediate Ranking Bolt (hashtags) (hashtag, count) Ranking Bolt (ranking) (trending topics)Kafka Topic Twitter to Kafka (tweet) Twitter Stream (tweet) (tweet) Storm Architecture Word Count Architecture • CPU intensive Rolling Sort Architecture • Memory intensive Applications: • Fraud detection • Trending topics
  • 14. Experimental results 0 20 40 60 80 100 Iteration 10 -3 10-2 10-1 100 10 1 102 10 3 104 AbsoluteError BO4CO SA GA HILL PS Drift 0 20 40 60 80 100 Iteration 10 -2 10-1 100 101 102 103 AbsoluteError BO4CO SA GA HILL PS Drift (a) WordCount(3D) (b) WordCount(5D) - 30 runs, report average performance - Yes, we did full factorial measurements and we know where global min is…
  • 15. Experimental results 0 50 100 150 200 Iteration 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 AbsoluteError BO4CO SA GA HILL PS Drift 0 50 100 150 200 Iteration 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 AbsoluteError BO4CO SA GA HILL PS Drift (a) SOL(6D) (b) RollingSort(6D)
  • 16. Experimental results 0 20 40 60 80 100 Iteration 10-4 10 -3 10-2 10-1 100 101 10 2 AbsoluteError BO4CO SA GA HILL PS Drift 0 20 40 60 80 100 Iteration 10-2 10 -1 10 0 101 AbsoluteError BO4CO SA GA HILL PS Drift (a) Branin(2D) (b) Dixon(2D)
  • 17. Model accuracy (comparison with polynomial regression models) BO4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 AbsolutePercentageError[%]
  • 18. Prediction accuracy over time 0 10 20 30 40 50 60 70 80 Iteration 10 1 10 2 10 3 PredictionError BO4CO polyfit1 M5Tree RegressionTree M5Rules LWP(GAU) PRIM
  • 19. Exploitation vs exploration 0 20 40 60 80 100 Iteration 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 AbsoluteError BO4CO(adaptive) BO4CO(µ:=0) BO4CO(κ:=0.1) BO4CO(κ:=1) BO4CO(κ:=6) BO4CO(κ:=8) 0 2000 4000 6000 8000 10000 Iteration 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 Kappa ϵ=1 ϵ=0.1 ϵ=0.01 the next configuration to measure. Intuitively, lect the minimum response. This is done using ction u : X ! R that determines xt+1 2 X, e evaluated next as: xt+1 = argmax x2X u(x|M, S1 1:t) (11) on criterion depends on the MTGP model M h its predictive mean µt(xt) and variance 2 t (xt) on observations S1 1:t. TL4CO uses the Lower ound (LCB) [24]: B(x|M, S1 1:t) = argmin x2X µt(x)  t(x), (12) xploitation-exploration parameter. For instance, to find a near optimal configuration we set a  to take the most out of the predictive mean. e are looking for a globally optimum one, we can ue in order to skip local minima. Furthermore, ted over time [22] to perform more explorations. ws that in TL4CO,  can start with a relatively at the early iterations comparing to BO4CO mer provides a better estimate of mean and xt+1 = argmax x2X u(x|M, S1 1:t) (11) e selection criterion depends on the MTGP model M through its predictive mean µt(xt) and variance 2 t (xt) tioned on observations S1 1:t. TL4CO uses the Lower dence Bound (LCB) [24]: uLCB(x|M, S1 1:t) = argmin x2X µt(x)  t(x), (12)  is a exploitation-exploration parameter. For instance, require to find a near optimal configuration we set a alue to  to take the most out of the predictive mean. ver, if we are looking for a globally optimum one, we can high value in order to skip local minima. Furthermore, be adapted over time [22] to perform more explorations. e 6 shows that in TL4CO,  can start with a relatively r value at the early iterations comparing to BO4CO the former provides a better estimate of mean and ore contains more information at the early stages. 4CO output. Once the Nmax di↵erent configurations of ystem under test are measured, the TL4CO algorithm nates. Finally, TL4CO produces the outputs including ptimal configuration (step 14 in Algorithm 1) as well
  • 20. Runtime overhead 0 20 40 60 80 100 Iteration 0.15 0.2 0.25 0.3 0.35 0.4 ElapsedTime(s) WordCount (3D) WordCount (6D) SOL (6D) RollingSort (6D) WordCount (5D) - The computation time in larger datasets is higher than those with less data and lower. - The computation time increases over time since the matrix size for Cholesky inversion gets larger. mean is shown in yellow and the 95% confidence interval at each point in the shaded red area. The stars indicate ex- perimental measurements (or observation interchangeably). Some points x 2 X have a large confidence interval due to lack of observations in their neighborhood, while others have a narrow confidence. The main motivation behind the choice of Bayesian Optimization here is that it o↵ers a framework in which reasoning can be not only based on mean estimates but also the variance, providing more informative decision making. The other reason is that all the computations in this framework are based on tractable linear algebra. In our previous work [21], we proposed BO4CO that ex- ploits single-task GPs (no transfer learning) for prediction of posterior distribution of response functions. A GP model is composed by its prior mean (µ(·) : X ! R) and a covariance function (k(·, ·) : X ⇥ X ! R) [41]: y = f(x) ⇠ GP(µ(x), k(x, x0 )), (2) where covariance k(x, x0 ) defines the distance between x and x0 . Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be the collection of t experimental data (observations). In this framework, we treat f(x) as a random variable, conditioned on observations S1:t, which is normally distributed with the following posterior mean and variance functions [41]: µt(x) = µ(x) + k(x)| (K + 2 I) 1 (y µ) (3) 2 t (x) = k(x, x) + 2 I k(x)| (K + 2 I) 1 k(x) (4) where y := y1:t, k(x)| = [k(x, x1) k(x, x2) . . . k(x, xt)], n approach using a 1-dimensional response. The blue is the unknown true response, whereas the hown in yellow and the 95% confidence interval at t in the shaded red area. The stars indicate ex- al measurements (or observation interchangeably). nts x 2 X have a large confidence interval due to servations in their neighborhood, while others have confidence. The main motivation behind the choice an Optimization here is that it o↵ers a framework easoning can be not only based on mean estimates he variance, providing more informative decision The other reason is that all the computations in ework are based on tractable linear algebra. previous work [21], we proposed BO4CO that ex- le-task GPs (no transfer learning) for prediction of distribution of response functions. A GP model is by its prior mean (µ(·) : X ! R) and a covariance k(·, ·) : X ⇥ X ! R) [41]: y = f(x) ⇠ GP(µ(x), k(x, x0 )), (2) variance k(x, x0 ) defines the distance between x Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be tion of t experimental data (observations). In this k, we treat f(x) as a random variable, conditioned ations S1:t, which is normally distributed with the
  • 21. Key takeaways - A principled way for - locating optimal configurations - carefully choosing where to sample - by sequentially reducing uncertainty in the response surface - in order to reduce the number of measurements - GPs appeared to be more accurate than other machine learning regression models in our experiments. - Applications to SPSs, Batch and NoSQL (see the BO4CO github page for more details)
  • 22. Acknowledgement: BO4CO as a part of DevOps pipeline in H2020 DICE Big Data Technologies Cloud (Priv/Pub) ` DICE IDE Profile Plugins Sim Ver Opt DPIM DTSM DDSM TOSCAMethodology Deploy Config Test M o n Anomaly Trace Iter. Enh. Data Intensive Application (DIA) Cont.Int. Fault Inj. WP4 WP3 WP2 WP5 WP1 WP6 - Demonstrators Code and data: https://github.com/dice-project/DICE-Configuration-BO4CO