Self-adaptive container monitoring with performance-aware Load-Shedding policies

1
Self-adaptive container monitoring with
performance-aware load-shedding policies
NECST Group Conference 2017 @ Oracle Labs
07/06/2017
Rolando Brondolin
rolando.brondolin@polimi.it
DEIB, Politecnico di Milano

Cloud trends
• 2017 State of the cloud [1]:
– 79% of workloads run in cloud (41% public, 38% private)
– Operations focusing on:
• moving more workloads to cloud
• existing cloud usage optimization (cost reduction)
2
• Nowadays Docker is becoming the de-facto standard for Cloud deployments
– lightweight abstraction on system resources
– fast deployment, management and maintenance
– large deployments and automatic orchestration
[1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale

3
#requests/s
heap size
CPU usage Q(t) λ(t) μ(t)
#store/s #load/s

Infrastructure monitoring (1)
• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
4
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
high visibility on system state
non negligible cost
few information on system state
cheap monitoring
VS

• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
few information on system state
cheap monitoring
high visibility on system state
non negligible cost
Infrastructure monitoring (2) 5
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
VS
High data granularity Good data granularity High data granularity
Code instrumentation Code instrumentation No instrumentation
Low metrics rate High metrics rate High metrics rate

Sysdig Cloud monitoring 6
http://www.sysdig.org
• Infrastructure for container monitoring
• Collects aggregated metrics and shows system state:
– “Drill-down” from cluster to single application metrics
– Dynamic network topology
– Alerting and anomaly detection
• Monitoring agent deployed on each machine in the cluster
– Traces system calls in a “streaming fashion”
– Aggregates data for Threads, FDs, applications, containers and hosts

IssuesEffectCause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
7
Events arrives at
really high frequency Queues grow
indefinitely
High usage of system
resources
Uncontrolled  
loss of events
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
Output quality
degradation

Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
8
Load Manager
*when*
aggregated
metrics
correction
LS Filter
*where*
Policy
wrapper
shedding
plan
Mitigate high usage of
system resources
Avoid uncontrolled  
loss of events
minimize output quality
degradation

Utilization-based Load Manager
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 9
Metrics
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
CPU utilization Arrived events Residual events
Current utilization Target utilization
Arrival rate
Max theoretical
throughput
Control errorThe requested throughput is used by the load shedding policies to derive the LS probabilities

Policy wrapper and policies
• The policy wrapper provides access to statistics of processes, the
requested throughput μ(t+1) and the system capacity μc(t)
10
Fair policy
• Assign to each process the “same" number  
of events
• Save metrics of small processes, still
accurate results on big ones
Priority-based policy
• Assign a static priority to each process
• Compute a weighted priority to partition
the system capacity
• Assign a partition to each process and
compute the probabilities
Metrics Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and

Load Shedding Filter
• The Load Shedding Filter applies the probabilities  
computed by the policies to the input stream
• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
• The dropped events are reported to the application for metrics correction
11
Metrics
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko

• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 13)
• Output quality (slides 14 15 16 17)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3,  
20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 12
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite

System stability 13
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load 
with the QoS requirement (Ut)
• Error measured with MAPE (lower  
is better) obtained running 20 times  
each benchmark
• 3.51x average MAPE improvement, 
average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%

Output quality - heterogeneous
• We tried to mix the homogeneous tests
• simulate co-located environment
• add OS scheduling uncertainty and noise
• QoS requirement Ut 1.1%
• MAPE (lower is better) between exact and approximated metrics
• Compare metrics from reference, FFWD fair, FFWD priority
• Three tests with different syscall mix:
• Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s
• Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s
• Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
14

1x Fio, 3x Nginx, 1.3M evt/s 15
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
MAPE(%)log kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better

1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
reference
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
reference
Volume metrics (byte r/w)
Latency metrics

17
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
reference
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
reference
Volume metrics (byte r/w)
Latency metrics
Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s
• Fair policy outperforms reference in almost all cases
• the LS Filter works at the single event level
• reference drops events in batches
• Priority policy improves the Fair policy results in most cases
• the prioritized processes are privileged
• other processes treated as “best-effort”
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s

Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
• Fast Forward With Degradation (FFWD)
– Heuristic controller for bounded CPU usage
– Pluggable policies for domain-specific load shedding
– Accurate computation of output metrics
– Load Shedding Filter for fast drop of events
18

19
Questions?
Rolando Brondolin, rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
NGC VIII 2017 @ SF
FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D.
Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)

Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
22
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s

1x simplefile, 1x nginx, 1.3M evt/s 23
1
10
100
1000
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics

Response time Load Manager 24
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
The requested throughput is used by the load shedding policies to derive the LS probabilities

25
S:
(Little’s Law)
Control error:
Old response time Target response time
Response time Load Manager

26
S:
(Little’s Law)
Control error:
Requested throughput Arrival rate
Control error
Response time Load Manager

Case studies 27
System monitoring [2]
• Goal: Distributed monitoring of systems
and applications w/syscalls
• Constraint: CPU utilization
• Based on: Sysdig monitoring agent
• Output: aggregated performance metrics
for applications, containers, hosts
• FFWD ensures low CPU overhead
• policies based on processes in the system
[1] http://nlp.stanford.edu [2] http://www.sysdig.org
Sentiment analysis [1]
• Goal: perform real-time analysis on tweets

Case studies 28
System monitoring [2]
• Goal: Distributed monitoring of systems
[1] http://nlp.stanford.edu [2] http://www.sysdig.org
Sentiment analysis [1]
• Goal: perform real-time analysis on tweets
• Constraint: Latency
• Based on: Stanford NLP toolkit
• Output: aggregated sentiment score for
each keyword and hashtag
• FFWD maintains limited the response time
• policies on tweet keyword and #hashtag

Real-time sentiment analysis 29
• Real-time sentiment analysis allows to:
– Track the sentiment of a topic over time
– Correlate real world events and related sentiment, e.g.
• Toyota crisis (2010) [1]
• 2012 US Presidential Election Cycle [2]
– Track online evolution of companies reputation, derive social
profiling and allow enhanced social marketing strategies
[1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research:
Workshop and Conference Proceedings Series. 2011.
[2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL
2012 System Demonstrations.

Sentiment analysis: case study 30
• Simple Twitter streaming sentiment analyzer with Stanford NLP
• System components:
– Event producer
– RabbitMQ queue
– Event consumer
• Consumer components:
– Event Capture
– Sentiment Analyzer
– Sentiment Aggregator
• Real-time queue consumption, aggregated metrics emission each second
(keywords and hashtag sentiment)

FFWD: Sentiment analysis 31
• FFWD adds four components:
– Load shedding filter at the beginning of the pipeline
– Shedding plan used by the filter
– Domain-specific policy wrapper
– Application controller manager to detect load peaks
Producer
Load Shedding
Filter
Event
Capture
Sentiment
Analyzer
Sentiment
Aggregator
Policy
Wrapper
Load Manager
Shedding
Plan
real-time queue
batch queue
ok
ko
ko count
account metrics
R(t)
stream statsupdated plan
μ(t+1)
event output metricsinput tweets
drop probability
Component
Data structure
Internal information ﬂow
External information ﬂow
Queue
analyze event
λ(t)
Rt

Sentiment - experimental setup 32
• Separate tests to understand FFWD behavior:
– System stability
– Output quality
• Dataset: 900K tweets of 35th week of Premier League
• Performed tests:
– Controller: synthetic and real tweets at various λ(t)
– Policy: real tweets at various λ(t)
• Evaluation setup
– Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC
– 8 GB RAM @ 1600 Mhz

System stability 33
case A: λ(t) = λ(t-1)
case B: λ(t) = avg(λ(t))
λ(t) estimation:

Load Manager showcase (1)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– response time:
34
0
1
2
3
4
5
6
7
0 50 100 150 200 250 300
Responsetime(s)
time (s)
Controller performance
QoS = 5s
R

Load Manager showcase (2)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– throughput:
35
0
100
200
300
400
500
0 50 100 150 200 250 300
#Events
time (s)
Actuation
lambda
dropped
computed
mu

Output Quality 36
• Real tweets, μc(t) ≃ 40 evt/s
• Evaluated policies:
• Baseline
• Fair
• Priority
• R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s
• Error metric: Mean Absolute Percentage
Error (MAPE %) (lower is better)
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 100 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 200 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 400 evt/s

Self-adaptive container monitoring with performance-aware Load-Shedding policies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

Self-adaptive container monitoring with performance-aware Load-Shedding policies