Pooyan Jamshidi CHOOSE Talk 2016-11-01
Big data architectures have been gaining momentum in recent years. For instance, Twitter uses stream processing frameworks like Storm to analyse billions of tweets per minute and learn the trending topics. However, architectures that process big data involve many different components interconnected via semantically different connectors making it a difficult task for software architects to refactor the initial designs. As an aid to designers and developers, we developed OSTIA (On-the-fly Static Topology Inference Analysis) that allows: (a) visualizing big data architectures for the purpose of design-time refactoring while maintaining constraints that would only be evaluated at later stages such as deployment and run-time; (b) detecting the occurrence of common anti-patterns across big data architectures; (c) exploiting software verification techniques on the elicited architectural models. In the lecture, OSTIA will be shown on three industrial-scale case studies.
See: http://www.choose.s-i.ch/events/jamshidi-2016/
3. Motivation
0 1 2 3 4 5
average read latency (Āµs) Ć10
4
0
20
40
60
80
100
120
140
160
observations
1000 1200 1400 1600 1800 2000
average read latency (Āµs)
0
10
20
30
40
50
60
70
observations
1
1
(a) cass-20 (b) cass-10
Best configurations
Worst configurations
Experiments on
Apache Cassandra:
- 6 parameters, 1024 configurations
- Average read latency
- 10 millions records (cass-10)
- 20 millions records (cass-20)
4. Motivation (Apache Storm)
number of counters
number of splitters
latency(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243 684 10125 14166 18
In our experiments we
observed improvement
up to 100%
5. Goal
is denoted by f(x). Throughout, we assume
ncy, however, other metrics for response may
re consider the problem of ļ¬nding an optimal
ā¤
that globally minimizes f(Ā·) over X:
xā¤
= arg min
x2X
f(x) (1)
esponse function f(Ā·) is usually unknown or
n, i.e., yi = f(xi), xi ā¢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ā. The
of the optimal conļ¬guration is thus a black-
on program subject to noise [27, 33], which
harder than deterministic optimization. A
n is based on sampling that starts with a
pled conļ¬gurations. The performance of the
sociated to this initial samples can deliver
tanding of f(Ā·) and guide the generation of
of samples. If properly guided, the process
ration-evaluation-feedback-regeneration will
tinuously, (ii) Big Data systems are d
frameworks (e.g., Apache Hadoop, S
on similar platforms (e.g., cloud clust
versions of a system often share a sim
To the best of our knowledge, only
the possibility of transfer learning in
The authors learn a Bayesian networ
of a system and reuse this model fo
systems. However, the learning is lim
the Bayesian network. In this paper,
that not only reuse a model that has b
but also the valuable raw data. There
to the accuracy of the learned model
consider Bayesian networks and inste
2.4 Motivation
A motivating example. We now
points on an example. WordCount (cf.
benchmark [12]. WordCount features
(Xi). In general, Xi may either indicate (i) integer vari-
such as level of parallelism or (ii) categorical variable
as messaging frameworks or Boolean variable such as
ng timeout. We use the terms parameter and factor in-
angeably; also, with the term option we refer to possible
s that can be assigned to a parameter.
assume that each conļ¬guration x 2 X in the conļ¬gura-
pace X = Dom(X1) ā„ Ā· Ā· Ā· ā„ Dom(Xd) is valid, i.e., the
m accepts this conļ¬guration and the corresponding test
s in a stable performance behavior. The response with
guration x is denoted by f(x). Throughout, we assume
f(Ā·) is latency, however, other metrics for response may
ed. We here consider the problem of ļ¬nding an optimal
guration xā¤
that globally minimizes f(Ā·) over X:
xā¤
= arg min
x2X
f(x) (1)
fact, the response function f(Ā·) is usually unknown or
ally known, i.e., yi = f(xi), xi ā¢ X. In practice, such
it still requires hundr
per, we propose to ad
with the search e ci
than starting the sear
the learned knowledg
software to accelerate
version. This idea is i
in real software engin
in DevOps diāµerent
tinuously, (ii) Big Da
frameworks (e.g., Ap
on similar platforms (
versions of a system o
To the best of our k
the possibility of tran
The authors learn a B
of a system and reus
systems. However, the
his conļ¬guration and the corresponding test
le performance behavior. The response with
is denoted by f(x). Throughout, we assume
ncy, however, other metrics for response may
e consider the problem of ļ¬nding an optimal
ā¤
that globally minimizes f(Ā·) over X:
xā¤
= arg min
x2X
f(x) (1)
esponse function f(Ā·) is usually unknown or
, i.e., yi = f(xi), xi ā¢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ā. The
f the optimal conļ¬guration is thus a black-
n program subject to noise [27, 33], which
harder than deterministic optimization. A
n is based on sampling that starts with a
pled conļ¬gurations. The performance of the
sociated to this initial samples can deliver
tanding of f(Ā·) and guide the generation of
of samples. If properly guided, the process
ation-evaluation-feedback-regeneration will
erge and the optimal conļ¬guration will be
r, a sampling-based approach of this kind can
in DevOps diāµerent versions of a system is delivere
tinuously, (ii) Big Data systems are developed using s
frameworks (e.g., Apache Hadoop, Spark, Kafka) an
on similar platforms (e.g., cloud clusters), (iii) and diāµ
versions of a system often share a similar business log
To the best of our knowledge, only one study [9] ex
the possibility of transfer learning in system conļ¬gur
The authors learn a Bayesian network in the tuning p
of a system and reuse this model for tuning other s
systems. However, the learning is limited to the struct
the Bayesian network. In this paper, we introduce a m
that not only reuse a model that has been learned prev
but also the valuable raw data. Therefore, we are not li
to the accuracy of the learned model. Moreover, we d
consider Bayesian networks and instead focus on MT
2.4 Motivation
A motivating example. We now illustrate the pre
points on an example. WordCount (cf. Figure 1) is a p
benchmark [12]. WordCount features a three-layer arc
ture that counts the number of words in the incoming s
A Processing Element (PE) of type Spout reads the
havior. The response with
. Throughout, we assume
metrics for response may
blem of ļ¬nding an optimal
nimizes f(Ā·) over X:
f(x) (1)
(Ā·) is usually unknown or
xi ā¢ X. In practice, such
i.e., yi = f(xi) + ā. The
ļ¬guration is thus a black-
t to noise [27, 33], which
ministic optimization. A
mpling that starts with a
. The performance of the
itial samples can deliver
d guide the generation of
perly guided, the process
in DevOps diāµerent versions of a system is delivered co
tinuously, (ii) Big Data systems are developed using simila
frameworks (e.g., Apache Hadoop, Spark, Kafka) and ru
on similar platforms (e.g., cloud clusters), (iii) and diāµeren
versions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] explore
the possibility of transfer learning in system conļ¬guratio
The authors learn a Bayesian network in the tuning proce
of a system and reuse this model for tuning other simila
systems. However, the learning is limited to the structure
the Bayesian network. In this paper, we introduce a metho
that not only reuse a model that has been learned previous
but also the valuable raw data. Therefore, we are not limite
to the accuracy of the learned model. Moreover, we do no
consider Bayesian networks and instead focus on MTGPs.
2.4 Motivation
A motivating example. We now illustrate the previou
points on an example. WordCount (cf. Figure 1) is a popula
benchmark [12]. WordCount features a three-layer archite
Partially known
Measurements subject to noise
Configuration space
6. Non-linear interactions
0 5 10 15 20
Number of counters
100
120
140
160
180
200
220
240
Latency(ms)
splitters=2
splitters=3
number of counters
number of splitters
latency(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243 684 10125 14166 18
Response surface is:
- Non-linear
- Non convex
- Multi-modal
7. The measurements are subject to variability
wc wc+rs wc+sol 2wc 2wc+rs+sol
10
1
10
2
Latency(ms)
The scale of
measurement variability
is different in different
deployments
(heteroscedastic noise)
at points x that has been
here consider the problem
xā¤
that minimizes f over
w experiments as possible:
f(x) (1)
) is usually unknown or
xi ā¢ X. In practice, such
.e., yi = f(xi) + āi. Note
ly partially-known, ļ¬nding
kbox optimization problem
noise. In fact, the problem
on-convex and multi-modal
P-hard [36]. Therefore, on
locate a global optimum,
st possible local optimum
udget.
It shows the non-convexity, multi-modality and the substantial
performance difference between different conļ¬gurations.
0 5 10 15 20
Number of counters
100
120
140
160
180
200
220
240
Latency(ms)
splitters=2
splitters=3
Fig. 3: WordCount latency, cut though Figure 2.
demonstrates that if one tries to minimize latency by acting
just on one of these parameters at the time, the resulting
9. GP for modeling blackbox response function
true function
GP mean
GP variance
observation
selected point
true
minimum
mposed by its prior mean (Āµ(Ā·) : X ! R) and a covariance
nction (k(Ā·, Ā·) : X ā„ X ! R) [41]:
y = f(x) ā GP(Āµ(x), k(x, x0
)), (2)
here covariance k(x, x0
) deļ¬nes the distance between x
d x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
e collection of t experimental data (observations). In this
mework, we treat f(x) as a random variable, conditioned
observations S1:t, which is normally distributed with the
lowing posterior mean and variance functions [41]:
Āµt(x) = Āµ(x) + k(x)|
(K + 2
I) 1
(y Āµ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
here y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
:= Āµ(x1:t), K := k(xi, xj) and I is identity matrix. The
ortcoming of BO4CO is that it cannot exploit the observa-
ns regarding other versions of the system and as therefore
nnot be applied in DevOps.
2 TL4CO: an extension to multi-tasks
TL4CO 1
uses MTGPs that exploit observations from other
evious versions of the system under test. Algorithm 1
ļ¬nes the internal details of TL4CO. As Figure 4 shows,
4CO is an iterative algorithm that uses the learning from
her system versions. In a high-level overview, TL4CO: (i)
ects the most informative past observations (details in
ction 3.3); (ii) ļ¬ts a model to existing data based on kernel
arning (details in Section 3.4), and (iii) selects the next
ork are based on tractable linear algebra.
evious work [21], we proposed BO4CO that ex-
task GPs (no transfer learning) for prediction of
tribution of response functions. A GP model is
y its prior mean (Āµ(Ā·) : X ! R) and a covariance
Ā·, Ā·) : X ā„ X ! R) [41]:
y = f(x) ā GP(Āµ(x), k(x, x0
)), (2)
iance k(x, x0
) deļ¬nes the distance between x
us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
n of t experimental data (observations). In this
we treat f(x) as a random variable, conditioned
ons S1:t, which is normally distributed with the
sterior mean and variance functions [41]:
Āµ(x) + k(x)|
(K + 2
I) 1
(y Āµ) (3)
k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
, K := k(xi, xj) and I is identity matrix. The
of BO4CO is that it cannot exploit the observa-
ng other versions of the system and as therefore
pplied in DevOps.
CO: an extension to multi-tasks
uses MTGPs that exploit observations from other
Motivations:
1- mean estimates + variance
2- all computations are linear algebra
3- good estimations when few data
10. Sparsity of Effects
ā¢ Correlation-based
feature selector
ā¢ Merit is used to select
subsets that are highly
correlated with the
response variable
ā¢ At most 2-3 parameters
were strongly interacting
with each other
TABLE I: Sparsity of effects on 5 experiments where we have varied
different subsets of parameters and used different testbeds. Note that
these are the datasets we experimentally measured on the benchmark
systems and we use them for the evaluation, more details including
the results for 6 more experiments are in the appendix.
Topol. Parameters Main factors Merit Size Testbed
1 wc(6D)
1-spouts, 2-max spout,
3-spout wait, 4-splitters,
5-counters, 6-netty min wait
{1, 2, 5} 0.787 2880 C1
2 sol(6D)
1-spouts, 2-max spout,
3-top level, 4-netty min wait,
5-message size, 6-bolts
{1, 2, 3} 0.447 2866 C2
3 rs(6D)
1-spouts, 2-max spout,
3-sorters, 4-emit freq,
5-chunk size, 6-message size
{3} 0.385 3840 C3
4 wc(3D)
1-max spout, 2-splitters,
3-counters {1, 2} 0.480 756 C4
5 wc(5D)
1-spouts, 2-splitters,
3-counters,
4-buffer-size, 5-heap
{1} 0.851 1080 C5
102
s)
Experiments on:
1. C1: OpenNebula (X)
2. C2: Amazon EC2 (Y)
3. C3: OpenNebula (3X)
4. C4: Amazon EC2 (2Y)
5. C5: Microsoft Azure (X)
11. -1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
x1 x2 x3 x4
true function
GP surrogate
mean estimate
observation
Fig. 5: An example of 1D GP model: GPs provide mean esti-
mates as well as the uncertainty in estimations, i.e., variance.
Conļ¬guration
Optimisation Tool
performance
repository
Monitoring
Deployment Service
Data Preparation
conļ¬guration
parameters
values
conļ¬guration
parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment time
polling interval
conļ¬guration
parameters
GP model
Kafka
System Under Test
Workload
Generator
Technology Interface
Storm
Cassandra
Spark
Algorithm 1 : BO4CO
Input: Conļ¬guration space X, Maximum budget Nmax, Re-
sponse function f, Kernel function Kā, Hyper-parameters
ā, Design sample size n, learning cycle Nl
Output: Optimal conļ¬gurations xā¤
and learned model M
1: choose an initial sparse design (lhd) to ļ¬nd an initial
design samples D = {x1, . . . , xn}
2: obtain performance measurements of the initial design,
yi f(xi) + āi, 8xi 2 D
3: S1:n {(xi, yi)}n
i=1; t n + 1
4: M(x|S1:n, ā) ļ¬t a GP model to the design . Eq.(3)
5: while t ļ£æ Nmax do
6: if (t mod Nl = 0) ā learn the kernel hyper-
parameters by maximizing the likelihood
7: ļ¬nd next conļ¬guration xt by optimizing the selection
criteria over the estimated response surface given the data,
xt arg maxxu(x|M, S1:t 1) . Eq.(9)
8: obtain performance for the new conļ¬guration xt, yt
f(xt) + āt
9: Augment the conļ¬guration S1:t = {S1:t 1, (xt, yt)}
10: M(x|S1:t, ā) re-ļ¬t a new GP model . Eq.(7)
11: t t + 1
12: end while
13: (xā¤
, yā¤
) = min S1:Nmax
14: M(x)
-1.5 -1 -0.5 0 0.5 1 1.5
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Conļ¬guration
Space
Empirical
Model
2
4
6
8
10
12
1
2
3
4
5
6
160
140
120
100
80
60
180
Experiment
(exhastive)
Experiment
Experiment
0 20 40 60 80 100 120 140 160 180 200
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Selection Criteria
(b) Sequential Design
(a) Design of Experiment
12. -1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
conļ¬guration domain
responsevalue
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
true response function
GP ļ¬t
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
criteria evaluation
new selected point
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
new GP ļ¬t
Acquisition function:
O then ļ¬ts a GP model to
elief about the underlying
rithm 1). The while loop in
belief until the budget runs
:t = {(xi, yi)}t
i=1, where
a prior distribution Pr(f)
1:t|f) form the posterior
) Pr(f).
ions [37], speciļ¬ed by its
iance (see Section III-E1):
), k(x, x0
)), (3)
where
Āµt(x) = Āµ(x) + k(x)|
(K + 2
I) 1
(y Āµ) (7)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (8)
These posterior functions are used to select the next point xt+1
as detailed in Section III-C.
C. Conļ¬guration selection criteria
The selection criteria is deļ¬ned as u : X ! R that selects
xt+1 2 X, should f(Ā·) be evaluated next (step 7):
xt+1 = argmax
x2X
u(x|M, S1:t) (9)
13. Logical
View
Physical
View
pipe
Spout A Bolt A Bolt B
socket socket
out queue in queue
Worker A Worker B Worker C
out queue in queue
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)
[paintings, 3]
[poems, 60]
[letter, 75]
Kafka Topic
Stream to
Kafka
File
(sentence)
(sentence)
(sentence)
Kafka Spout
RollingCount
Bolt
Intermediate
Ranking Bolt
(hashtags)
(hashtag,
count)
Ranking
Bolt
(ranking)
(trending
topics)Kafka Topic
Twitter to
Kafka
(tweet)
Twitter Stream
(tweet)
(tweet)
Storm Architecture
Word Count Architecture
ā¢ CPU intensive
Rolling Sort Architecture
ā¢ Memory intensive
Applications:
ā¢ Fraud detection
ā¢ Trending topics
14. Experimental results
0 20 40 60 80 100
Iteration
10
-3
10-2
10-1
100
101
10
2
103
10
4
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
0 20 40 60 80 100
Iteration
10
-2
10-1
100
101
102
10
3
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
(a) WordCount(3D) (b) WordCount(5D)
- 30 runs, report average performance
- Yes, we did full factorial
measurements and we know where
global min isā¦
15. Experimental results
0 50 100 150 200
Iteration
10
-2
10
-1
10
0
101
10
2
10
3
104
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
0 50 100 150 200
Iteration
10
-2
10
-1
10
0
101
10
2
10
3
104
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
(a) SOL(6D) (b) RollingSort(6D)
16. Experimental results
0 20 40 60 80 100
Iteration
10-4
10-3
10-2
10-1
100
101
102
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
0 20 40 60 80 100
Iteration
10-2
10
-1
100
10
1
AbsoluteError
BO4CO
SA
GA
HILL
PS
Drift
(a) Branin(2D) (b) Dixon(2D)
18. Prediction accuracy over time
0 10 20 30 40 50 60 70 80
Iteration
10
1
10
2
10
3
PredictionError
BO4CO
polyfit1
M5Tree
RegressionTree
M5Rules
LWP(GAU)
PRIM
19. Exploitation vs exploration
0 20 40 60 80 100
Iteration
10
-4
10
-3
10
-2
10
-1
10
0
10
1
10
2
AbsoluteError
BO4CO(adaptive)
BO4CO(Āµ:=0)
BO4CO(Īŗ:=0.1)
BO4CO(Īŗ:=1)
BO4CO(Īŗ:=6)
BO4CO(Īŗ:=8)
0 2000 4000 6000 8000 10000
Iteration
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
Kappa
Ļµ=1
Ļµ=0.1
Ļµ=0.01
the next conļ¬guration to measure. Intuitively,
lect the minimum response. This is done using
ction u : X ! R that determines xt+1 2 X,
e evaluated next as:
xt+1 = argmax
x2X
u(x|M, S1
1:t) (11)
on criterion depends on the MTGP model M
h its predictive mean Āµt(xt) and variance 2
t (xt)
on observations S1
1:t. TL4CO uses the Lower
ound (LCB) [24]:
B(x|M, S1
1:t) = argmin
x2X
Āµt(x) ļ£æ t(x), (12)
xploitation-exploration parameter. For instance,
to ļ¬nd a near optimal conļ¬guration we set a
ļ£æ to take the most out of the predictive mean.
e are looking for a globally optimum one, we can
ue in order to skip local minima. Furthermore,
ted over time [22] to perform more explorations.
ws that in TL4CO, ļ£æ can start with a relatively
at the early iterations comparing to BO4CO
mer provides a better estimate of mean and
xt+1 = argmax
x2X
u(x|M, S1
1:t) (11)
e selection criterion depends on the MTGP model M
through its predictive mean Āµt(xt) and variance 2
t (xt)
tioned on observations S1
1:t. TL4CO uses the Lower
dence Bound (LCB) [24]:
uLCB(x|M, S1
1:t) = argmin
x2X
Āµt(x) ļ£æ t(x), (12)
ļ£æ is a exploitation-exploration parameter. For instance,
require to ļ¬nd a near optimal conļ¬guration we set a
alue to ļ£æ to take the most out of the predictive mean.
ver, if we are looking for a globally optimum one, we can
high value in order to skip local minima. Furthermore,
be adapted over time [22] to perform more explorations.
e 6 shows that in TL4CO, ļ£æ can start with a relatively
r value at the early iterations comparing to BO4CO
the former provides a better estimate of mean and
ore contains more information at the early stages.
4CO output. Once the Nmax diāµerent conļ¬gurations of
ystem under test are measured, the TL4CO algorithm
nates. Finally, TL4CO produces the outputs including
ptimal conļ¬guration (step 14 in Algorithm 1) as well
20. Runtime overhead
0 20 40 60 80 100
Iteration
0.15
0.2
0.25
0.3
0.35
0.4
ElapsedTime(s)
WordCount (3D)
WordCount (6D)
SOL (6D)
RollingSort (6D)
WordCount (5D)
- The computation time in larger
datasets is higher than those with
less data and lower.
- The computation time increases
over time since the matrix size for
Cholesky inversion gets larger.
mean is shown in yellow and the 95% conļ¬dence interval at
each point in the shaded red area. The stars indicate ex-
perimental measurements (or observation interchangeably).
Some points x 2 X have a large conļ¬dence interval due to
lack of observations in their neighborhood, while others have
a narrow conļ¬dence. The main motivation behind the choice
of Bayesian Optimization here is that it oāµers a framework
in which reasoning can be not only based on mean estimates
but also the variance, providing more informative decision
making. The other reason is that all the computations in
this framework are based on tractable linear algebra.
In our previous work [21], we proposed BO4CO that ex-
ploits single-task GPs (no transfer learning) for prediction of
posterior distribution of response functions. A GP model is
composed by its prior mean (Āµ(Ā·) : X ! R) and a covariance
function (k(Ā·, Ā·) : X ā„ X ! R) [41]:
y = f(x) ā GP(Āµ(x), k(x, x0
)), (2)
where covariance k(x, x0
) deļ¬nes the distance between x
and x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
the collection of t experimental data (observations). In this
framework, we treat f(x) as a random variable, conditioned
on observations S1:t, which is normally distributed with the
following posterior mean and variance functions [41]:
Āµt(x) = Āµ(x) + k(x)|
(K + 2
I) 1
(y Āµ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
where y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
n approach using a 1-dimensional response. The
blue is the unknown true response, whereas the
hown in yellow and the 95% conļ¬dence interval at
t in the shaded red area. The stars indicate ex-
al measurements (or observation interchangeably).
nts x 2 X have a large conļ¬dence interval due to
servations in their neighborhood, while others have
conļ¬dence. The main motivation behind the choice
an Optimization here is that it oāµers a framework
easoning can be not only based on mean estimates
he variance, providing more informative decision
The other reason is that all the computations in
ework are based on tractable linear algebra.
previous work [21], we proposed BO4CO that ex-
le-task GPs (no transfer learning) for prediction of
distribution of response functions. A GP model is
by its prior mean (Āµ(Ā·) : X ! R) and a covariance
k(Ā·, Ā·) : X ā„ X ! R) [41]:
y = f(x) ā GP(Āµ(x), k(x, x0
)), (2)
variance k(x, x0
) deļ¬nes the distance between x
Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
tion of t experimental data (observations). In this
k, we treat f(x) as a random variable, conditioned
ations S1:t, which is normally distributed with the
24. The case where we learn from correlated responses
-1.5 -1 -0.5 0 0.5 1 1.5
-4
-3
-2
-1
0
1
2
3
(a) 3 sample response functions
conļ¬guration domain
responsevalue
(1)
(2)
(3)
observations
(b) GP ļ¬t for (1) ignoring observations for (2),(3)
LCB
not informative
(c) multi-task GP ļ¬t for (1) by transfer learning from (2),(3)
highly informative
GP prediction mean
GP prediction variance
probability distribution
of the minimizers
25. Comparison with default and expert prescription
0 500 1000 1500
Throughput (ops/sec)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Averagereadlatency(Āµs)
Ć10
4
TL4CO
BO4CO
BO4CO after
20 iterations TL4CO after
20 iterations
TL4CO after
100 iterations
0 500 1000 1500
Throughput (ops/sec)
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Averagewritelatency(Āµs)
TL4CO
BO4CO
Default conļ¬guration
Conļ¬guration
recommended
by expert
TL4CO after
100 iterations
BO4CO after
100 iterations
Default conļ¬guration
Conļ¬guration
recommended
by expert
27. Entropy of the density function of the minimizers
0 20 40 60 80 100
0
1
2
3
4
5
6
7
8
9
10
Entropy
T=1(BO4CO)
T=2,m=100
T=2,m=200
T=2,m=300
T=2,m=400
T=3,m=100
1 2 3 4 5 6 7 8 9
0
2
4
6
8
10
BO4CO
TL4CO
Entropy
Iteration
Branin Hartmann WC(3D) SOL(6D) WC(5D)Dixon WC(6D) RS(6D) cass-20
he knowledge about the location of optimum conļ¬gura-
is summarized by the approximation of the conditional
ability density function of the response function mini-
rs, i.e., Xā¤
= Pr(xā¤
|f(x)), where f(Ā·) is drawn from
MTGP model (cf. solid red line in Figure 5(b,c)). The
opy of the density functions in Figure 5(b,c) are 6.39,
so we know more information about the latter.
he results in Figure 19 conļ¬rm that the entropy measure
e minimizers with the models provided by TL4CO for all
datasets (synthetic and real) signiļ¬cantly contains more
mation. The results demonstrate that the main reason
ļ¬nding quick convergence comparing with the baselines
at TL4CO employs a more eāµective model. The results
igure 19(b) show the change of entropy of Xā¤
over time
WC(5D) dataset. First, it shows that in TL4CO, the
opy decreases sharply. However, the overall decrease of
opy for BO4CO is slow. The second observation is that
TL4CO
variance,
storing K
making th
5. DIS
5.1 Be
TL4CO
experimen
practice.
than thre
the system
our appro
Knowledge about the location of the minimizer
28. Takeaways
Ć Be aware of Uncertainty
- By quantifying the uncertainty
- Make decisions taking into account the right level of uncertainty (homoscedastic vs
heteroscedastic)
- Uncertainty sometimes helps (models that provide an estimation of the uncertainty
are typically more informative)
- By exploiting this knowledge you can only explore interesting zones rather than
learning the whole performance function
Ć You can learn from operational data
- Not only from the current version, but from previous measurements as well
- Use the learning from past measurements as prior knowledge
- Too much data can be also harmful, it would slow down or blur the proper learning
(negative transfer)
29. Acknowledgement:
-BO4CO as a part of DevOps pipeline in H2020 DICE
-BO4CO is being acquired by TATA (TCS)
Big Data Technologies
Cloud (Priv/Pub)
`
DICE IDE
Profile
Plugins
Sim Ver Opt
DPIM
DTSM
DDSM TOSCAMethodology
Deploy Config Test
M
o
n
Anomaly
Trace
Iter. Enh.
Data Intensive Application (DIA)
Cont.Int. Fault Inj.
WP4
WP3
WP2
WP5
WP1 WP6 - Demonstrators
Code and data: https://github.com/dice-project/DICE-Configuration-BO4CO