SlideShare a Scribd company logo
1
Self-adaptive container monitoring with
performance-aware load-shedding policies
NECST Group Conference 2017 @ Sysdig
07/05/2017
Rolando Brondolin
rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
Cloud trends
• 2017 State of the cloud [1]:
– 79% of workloads run in cloud (41% public, 38% private)
– Operations focusing on:
• moving more workloads to cloud
• existing cloud usage optimization (cost reduction)
2
• Nowadays Docker is becoming the de-facto standard for Cloud deployments
– lightweight abstraction on system resources
– fast deployment, management and maintenance
– large deployments and automatic orchestration
[1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
3
#requests/s
heap size
CPU usage Q(t) λ(t) μ(t)
#store/s#load/s
Infrastructure monitoring (1)
• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
4
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
high visibility on system state
non negligible cost
few information on system state
cheap monitoring
VS
• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
few information on system state
cheap monitoring
high visibility on system state
non negligible cost
Infrastructure monitoring (2) 5
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
VS
High data granularity Good data granularity High data granularity
Code instrumentation Code instrumentation No instrumentation
Low metrics rate High metrics rate High metrics rate
Sysdig Cloud monitoring 6
http://www.sysdig.org
• Infrastructure for container monitoring
• Collects aggregated metrics and shows system state:
– “Drill-down” from cluster to single application metrics
– Dynamic network topology
– Alerting and anomaly detection
• Monitoring agent deployed on each machine in the cluster
– Traces system calls in a “streaming fashion”
– Aggregates data for Threads, FDs, applications, containers and hosts
IssuesEffectCause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
7
Events arrives at
really high frequency Queues grow
indefinitely
High usage of system
resources
Uncontrolled 

loss of events
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
Output quality
degradation
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
8
Load Manager
*when*
aggregated
metrics
correction
LS Filter
*where*
Policy wrapper
*how much*
shedding
plan
Mitigate high usage of
system resources
Avoid uncontrolled 

loss of events
minimize output quality
degradation
Utilization-based Load Manager
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 9
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
CPU utilization Arrived events Residual events
Current utilization Target utilization
Arrival rate
Max theoretical
throughput
Control errorThe requested throughput is used by the load shedding policies to derive the LS probabilities
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
10
Fair policy
• Assign to each process the “same" number 

of events
• Save metrics of small processes, still
accurate results on big ones
Priority-based policy
• Assign a static priority to each process
• Compute a weighted priority to partition
the system capacity
• Assign a partition to each process and
compute the probabilities
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))
Load Shedding Filter
• The Load Shedding Filter applies the probabilities 

computed by the policies to the input stream
• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
• The dropped events are reported to the application for metrics correction
11
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 13)
• Output quality (slides 14 15 16 17)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 12
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite
System stability 13
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load

with the QoS requirement (Ut)
• Error measured with MAPE (lower 

is better) obtained running 20 times 

each benchmark
• 3.51x average MAPE improvement,

average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
Output quality - heterogeneous
• We tried to mix the homogeneous tests
• simulate co-located environment
• add OS scheduling uncertainty and noise
• QoS requirement Ut 1.1%
• MAPE (lower is better) between exact and approximated metrics
• Compare metrics from reference, FFWD fair, FFWD priority
• Three tests with different syscall mix:
• Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s
• Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s
• Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
14
1x Fio, 3x Nginx, 1.3M evt/s 15
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
17
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s
• Fair policy outperforms reference in almost all cases
• the LS Filter works at the single event level
• reference drops events in batches
• Priority policy improves the Fair policy results in most cases
• the prioritized processes are privileged
• other processes treated as “best-effort”
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
• Fast Forward With Degradation (FFWD)
– Heuristic controller for bounded CPU usage
– Pluggable policies for domain-specific load shedding
– Accurate computation of output metrics
– Load Shedding Filter for fast drop of events
18
19
Questions?
Rolando Brondolin, rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
NGC VIII 2017 @ SF
FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D.
Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)
20
BACKUP SLIDES
21
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
22
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
1x simplefile, 1x nginx, 1.3M evt/s 23
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
Response time Load Manager 24
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
25
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
Old response time Target response time
Response time Load Manager
26
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
Requested throughput Arrival rate
Control error
Response time Load Manager
Case studies 27
System monitoring [2]
• Goal: Distributed monitoring of systems
and applications w/syscalls
• Constraint: CPU utilization
• Based on: Sysdig monitoring agent
• Output: aggregated performance metrics
for applications, containers, hosts
• FFWD ensures low CPU overhead
• policies based on processes in the system
[1] http://nlp.stanford.edu [2] http://www.sysdig.org
Sentiment analysis [1]
• Goal: perform real-time analysis on tweets
Case studies 28
System monitoring [2]
• Goal: Distributed monitoring of systems
[1] http://nlp.stanford.edu [2] http://www.sysdig.org
Sentiment analysis [1]
• Goal: perform real-time analysis on tweets
• Constraint: Latency
• Based on: Stanford NLP toolkit
• Output: aggregated sentiment score for
each keyword and hashtag
• FFWD maintains limited the response time
• policies on tweet keyword and #hashtag
Real-time sentiment analysis 29
• Real-time sentiment analysis allows to:
– Track the sentiment of a topic over time
– Correlate real world events and related sentiment, e.g.
• Toyota crisis (2010) [1]
• 2012 US Presidential Election Cycle [2]
– Track online evolution of companies reputation, derive social
profiling and allow enhanced social marketing strategies
[1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research:
Workshop and Conference Proceedings Series. 2011.
[2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL
2012 System Demonstrations.
Sentiment analysis: case study 30
• Simple Twitter streaming sentiment analyzer with Stanford NLP
• System components:
– Event producer
– RabbitMQ queue
– Event consumer
• Consumer components:
– Event Capture
– Sentiment Analyzer
– Sentiment Aggregator
• Real-time queue consumption, aggregated metrics emission each second
(keywords and hashtag sentiment)
FFWD: Sentiment analysis 31
• FFWD adds four components:
– Load shedding filter at the beginning of the pipeline
– Shedding plan used by the filter
– Domain-specific policy wrapper
– Application controller manager to detect load peaks
Producer
Load Shedding
Filter
Event
Capture
Sentiment
Analyzer
Sentiment
Aggregator
Policy
Wrapper
Load Manager
Shedding
Plan
real-time queue
batch queue
ok
ko
ko count
account metrics
R(t)
stream statsupdated plan
μ(t+1)
event output metricsinput tweets
drop probability
Component
Data structure
Internal information flow
External information flow
Queue
analyze event
λ(t)
Rt
Sentiment - experimental setup 32
• Separate tests to understand FFWD behavior:
– System stability
– Output quality
• Dataset: 900K tweets of 35th week of Premier League
• Performed tests:
– Controller: synthetic and real tweets at various λ(t)
– Policy: real tweets at various λ(t)
• Evaluation setup
– Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC
– 8 GB RAM @ 1600 Mhz
System stability 33
case A: λ(t) = λ(t-1)
case B: λ(t) = avg(λ(t))
λ(t) estimation:
Load Manager showcase (1)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– response time:
34
0
1
2
3
4
5
6
7
0 50 100 150 200 250 300
Responsetime(s)
time (s)
Controller performance
QoS = 5s
R
Load Manager showcase (2)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– throughput:
35
0
100
200
300
400
500
0 50 100 150 200 250 300
#Events
time (s)
Actuation
lambda
dropped
computed
mu
Output Quality 36
• Real tweets, μc(t) ≃ 40 evt/s
• Evaluated policies:
• Baseline
• Fair
• Priority
• R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s
• Error metric: Mean Absolute Percentage
Error (MAPE %) (lower is better)
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 100 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 200 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 400 evt/s

More Related Content

What's hot

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 

What's hot (20)

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Big Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open WorkshopBig Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open Workshop
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With Degradation
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing
 
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)
 
BREEZE 3D Analyst for the Advanced AERMOD Modeler
BREEZE 3D Analyst for the Advanced AERMOD ModelerBREEZE 3D Analyst for the Advanced AERMOD Modeler
BREEZE 3D Analyst for the Advanced AERMOD Modeler
 
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsLatency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
 
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
 
Minimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data WarehousesMinimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data Warehouses
 
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingTutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Daamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaperDaamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaper
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
Continental division of load and balanced ant
Continental division of load and balanced antContinental division of load and balanced ant
Continental division of load and balanced ant
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies

Design Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP OverloadDesign Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP Overload
Yang Hong
 
Mitigating SIP Overload Using a Control-Theoretic Approach
Mitigating SIP Overload Using a Control-Theoretic ApproachMitigating SIP Overload Using a Control-Theoretic Approach
Mitigating SIP Overload Using a Control-Theoretic Approach
Yang Hong
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depth
IdcIdk1
 
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Andrea Tino
 

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies (20)

Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
 
Design Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP OverloadDesign Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP Overload
 
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
 
Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...
 
Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...
Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...
Fairness in Transfer Control Protocol for Congestion Control in Multiplicativ...
 
Mitigating SIP Overload Using a Control-Theoretic Approach
Mitigating SIP Overload Using a Control-Theoretic ApproachMitigating SIP Overload Using a Control-Theoretic Approach
Mitigating SIP Overload Using a Control-Theoretic Approach
 
Proportional-integral genetic algorithm controller for stability of TCP network
Proportional-integral genetic algorithm controller for stability of TCP network Proportional-integral genetic algorithm controller for stability of TCP network
Proportional-integral genetic algorithm controller for stability of TCP network
 
Robust PID Controller Design for Non-Minimum Phase Systems using Magnitude Op...
Robust PID Controller Design for Non-Minimum Phase Systems using Magnitude Op...Robust PID Controller Design for Non-Minimum Phase Systems using Magnitude Op...
Robust PID Controller Design for Non-Minimum Phase Systems using Magnitude Op...
 
Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check  Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Linux capacity planning
Linux capacity planningLinux capacity planning
Linux capacity planning
 
Automated Parameterization of Performance Models from Measurements
Automated Parameterization of Performance Models from MeasurementsAutomated Parameterization of Performance Models from Measurements
Automated Parameterization of Performance Models from Measurements
 
Pdcs2010 balman-presentation
Pdcs2010 balman-presentationPdcs2010 balman-presentation
Pdcs2010 balman-presentation
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depth
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimation
 
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
 
Fpga implementation of optimal step size nlms algorithm and its performance a...
Fpga implementation of optimal step size nlms algorithm and its performance a...Fpga implementation of optimal step size nlms algorithm and its performance a...
Fpga implementation of optimal step size nlms algorithm and its performance a...
 
Fpga implementation of optimal step size nlms algorithm and its performance a...
Fpga implementation of optimal step size nlms algorithm and its performance a...Fpga implementation of optimal step size nlms algorithm and its performance a...
Fpga implementation of optimal step size nlms algorithm and its performance a...
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
Kamal Acharya
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
Atif Razi
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 

Recently uploaded (20)

KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and ClusteringKIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 

Self-adaptive container monitoring with performance-aware Load-Shedding policies

  • 1. 1 Self-adaptive container monitoring with performance-aware load-shedding policies NECST Group Conference 2017 @ Sysdig 07/05/2017 Rolando Brondolin rolando.brondolin@polimi.it DEIB, Politecnico di Milano
  • 2. Cloud trends • 2017 State of the cloud [1]: – 79% of workloads run in cloud (41% public, 38% private) – Operations focusing on: • moving more workloads to cloud • existing cloud usage optimization (cost reduction) 2 • Nowadays Docker is becoming the de-facto standard for Cloud deployments – lightweight abstraction on system resources – fast deployment, management and maintenance – large deployments and automatic orchestration [1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
  • 3. 3 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s#load/s
  • 4. Infrastructure monitoring (1) • Container complexity demands strong monitoring capabilities – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption 4 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s #load/s high visibility on system state non negligible cost few information on system state cheap monitoring VS
  • 5. • Container complexity demands strong monitoring capabilities – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption few information on system state cheap monitoring high visibility on system state non negligible cost Infrastructure monitoring (2) 5 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s #load/s VS High data granularity Good data granularity High data granularity Code instrumentation Code instrumentation No instrumentation Low metrics rate High metrics rate High metrics rate
  • 6. Sysdig Cloud monitoring 6 http://www.sysdig.org • Infrastructure for container monitoring • Collects aggregated metrics and shows system state: – “Drill-down” from cluster to single application metrics – Dynamic network topology – Alerting and anomaly detection • Monitoring agent deployed on each machine in the cluster – Traces system calls in a “streaming fashion” – Aggregates data for Threads, FDs, applications, containers and hosts
  • 7. IssuesEffectCause Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 7 Events arrives at really high frequency Queues grow indefinitely High usage of system resources Uncontrolled 
 loss of events S λ(t) φ(t) μ(t) Λ Φ Q S φ(t) μ(t) Φ Q of a streaming system with queue, processing element and streaming output flow . A server S, fed by a queue Q, is in overloading eater than the service rate µ(t). The stability condition stated he necessary and sufficient condition to avoid overloading. A ncing overloading should discard part of the input to increase to match the arrival rate (t). µ(t)  (t) (2.1) rmalizing is twofold, as we are interested not only in controlling t also in maximizing the accuracy of the estimated metrics. To which represents the input flow at a given time t; and ˜x, which ut flow considered in case of overloading at the same time t. If Output quality degradation
  • 8. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 8 Load Manager *when* aggregated metrics correction LS Filter *where* Policy wrapper *how much* shedding plan Mitigate high usage of system resources Avoid uncontrolled 
 loss of events minimize output quality degradation
  • 9. Utilization-based Load Manager The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 9 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained, ice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system rium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- ly and can be greater than the system capacity µc(t), defined as the rate of events ted per second. Given the control action µ(t) (i.e., the throughput of the system) e system capacity, we can define µd(t) as the dropping rate of the LS. As we did ), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory CPU utilization Arrived events Residual events Current utilization Target utilization Arrival rate Max theoretical throughput Control errorThe requested throughput is used by the load shedding policies to derive the LS probabilities
  • 10. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 10 Fair policy • Assign to each process the “same" number 
 of events • Save metrics of small processes, still accurate results on big ones Priority-based policy • Assign a static priority to each process • Compute a weighted priority to partition the system capacity • Assign a partition to each process and compute the probabilities Load Manager LS Filter Policies SP Metrics correction Baseline policy • Compute one LS probability for all processes (with μ(t+1) and μc(t))
  • 11. Load Shedding Filter • The Load Shedding Filter applies the probabilities 
 computed by the policies to the input stream • For each event: • Look for load shedding probability depending on input class • If no data is found we can drop the event • Otherwise, apply the Load Shedding probability computed by the policy • The dropped events are reported to the application for metrics correction 11 Load Manager LS Filter Policies SP Metrics correction Load Shedding Filter Shedding Plan event buffers ok drop probability Event Capture ko
  • 12. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 13) • Output quality (slides 14 15 16 17) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 12 test ID name priority # evts/s A nginx 3 800K B postmark 4 1,2M C fio 4 1,3M D simplefile 2 1,5M E apache 2 1,9M test ID instances # evts/s F 3x nginx, 1x fio 1,3M G 1x nginx, 1x simplefile 1,3M H 1x apache, 2x postmark, 1x fio 1,8M Homogeneous benchmarks Heterogeneous benchmarks Syscall intensive benchmarks from Phoronix test suite
  • 13. System stability 13 • We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G) • With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity) • Measuring the CPU load of the sysdig agent with: • reference implementation • FFWD with fair and priority policy • We compared the actual CPU load
 with the QoS requirement (Ut) • Error measured with MAPE (lower 
 is better) obtained running 20 times 
 each benchmark • 3.51x average MAPE improvement,
 average MAPE below 5% Test Ut = 1.1% reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 14. Output quality - heterogeneous • We tried to mix the homogeneous tests • simulate co-located environment • add OS scheduling uncertainty and noise • QoS requirement Ut 1.1% • MAPE (lower is better) between exact and approximated metrics • Compare metrics from reference, FFWD fair, FFWD priority • Three tests with different syscall mix: • Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s • Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s • Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 14
  • 15. 1x Fio, 3x Nginx, 1.3M evt/s 15 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics MAPE lower is better
  • 16. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 17. 17 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s • Fair policy outperforms reference in almost all cases • the LS Filter works at the single event level • reference drops events in batches • Priority policy improves the Fair policy results in most cases • the prioritized processes are privileged • other processes treated as “best-effort” 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
  • 18. Conclusion • We saw the main challenges of Load Shedding for container monitoring – Low overhead monitoring – High quality and granularity of metrics • Fast Forward With Degradation (FFWD) – Heuristic controller for bounded CPU usage – Pluggable policies for domain-specific load shedding – Accurate computation of output metrics – Load Shedding Filter for fast drop of events 18
  • 19. 19 Questions? Rolando Brondolin, rolando.brondolin@polimi.it DEIB, Politecnico di Milano NGC VIII 2017 @ SF FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D. Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)
  • 20. 20
  • 22. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 22 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 23. 1x simplefile, 1x nginx, 1.3M evt/s 23 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 24. Response time Load Manager 24 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities
  • 25. 25 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities Old response time Target response time Response time Load Manager
  • 26. 26 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities Requested throughput Arrival rate Control error Response time Load Manager
  • 27. Case studies 27 System monitoring [2] • Goal: Distributed monitoring of systems and applications w/syscalls • Constraint: CPU utilization • Based on: Sysdig monitoring agent • Output: aggregated performance metrics for applications, containers, hosts • FFWD ensures low CPU overhead • policies based on processes in the system [1] http://nlp.stanford.edu [2] http://www.sysdig.org Sentiment analysis [1] • Goal: perform real-time analysis on tweets
  • 28. Case studies 28 System monitoring [2] • Goal: Distributed monitoring of systems [1] http://nlp.stanford.edu [2] http://www.sysdig.org Sentiment analysis [1] • Goal: perform real-time analysis on tweets • Constraint: Latency • Based on: Stanford NLP toolkit • Output: aggregated sentiment score for each keyword and hashtag • FFWD maintains limited the response time • policies on tweet keyword and #hashtag
  • 29. Real-time sentiment analysis 29 • Real-time sentiment analysis allows to: – Track the sentiment of a topic over time – Correlate real world events and related sentiment, e.g. • Toyota crisis (2010) [1] • 2012 US Presidential Election Cycle [2] – Track online evolution of companies reputation, derive social profiling and allow enhanced social marketing strategies [1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research: Workshop and Conference Proceedings Series. 2011. [2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL 2012 System Demonstrations.
  • 30. Sentiment analysis: case study 30 • Simple Twitter streaming sentiment analyzer with Stanford NLP • System components: – Event producer – RabbitMQ queue – Event consumer • Consumer components: – Event Capture – Sentiment Analyzer – Sentiment Aggregator • Real-time queue consumption, aggregated metrics emission each second (keywords and hashtag sentiment)
  • 31. FFWD: Sentiment analysis 31 • FFWD adds four components: – Load shedding filter at the beginning of the pipeline – Shedding plan used by the filter – Domain-specific policy wrapper – Application controller manager to detect load peaks Producer Load Shedding Filter Event Capture Sentiment Analyzer Sentiment Aggregator Policy Wrapper Load Manager Shedding Plan real-time queue batch queue ok ko ko count account metrics R(t) stream statsupdated plan μ(t+1) event output metricsinput tweets drop probability Component Data structure Internal information flow External information flow Queue analyze event λ(t) Rt
  • 32. Sentiment - experimental setup 32 • Separate tests to understand FFWD behavior: – System stability – Output quality • Dataset: 900K tweets of 35th week of Premier League • Performed tests: – Controller: synthetic and real tweets at various λ(t) – Policy: real tweets at various λ(t) • Evaluation setup – Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC – 8 GB RAM @ 1600 Mhz
  • 33. System stability 33 case A: λ(t) = λ(t-1) case B: λ(t) = avg(λ(t)) λ(t) estimation:
  • 34. Load Manager showcase (1) • Load Manager demo (Rt = 5s): – λ(t) increased after 60s and 240s – response time: 34 0 1 2 3 4 5 6 7 0 50 100 150 200 250 300 Responsetime(s) time (s) Controller performance QoS = 5s R
  • 35. Load Manager showcase (2) • Load Manager demo (Rt = 5s): – λ(t) increased after 60s and 240s – throughput: 35 0 100 200 300 400 500 0 50 100 150 200 250 300 #Events time (s) Actuation lambda dropped computed mu
  • 36. Output Quality 36 • Real tweets, μc(t) ≃ 40 evt/s • Evaluated policies: • Baseline • Fair • Priority • R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s • Error metric: Mean Absolute Percentage Error (MAPE %) (lower is better) 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 100 evt/s 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 200 evt/s 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 400 evt/s