SlideShare a Scribd company logo
1
Self-adaptive container monitoring with
performance-aware load-shedding policies
NECST Group Conference 2017 @ Pinterest
06/05/2017
Rolando Brondolin
rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
Cloud trends
• 2017 State of the cloud [1]:
– 79% of workloads run in cloud (41% public, 38% private)
– Operations focusing on:
• moving more workloads to cloud
• existing cloud usage optimization (cost reduction)
2
[1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
Cloud trends
• 2017 State of the cloud [1]:
– 79% of workloads run in cloud (41% public, 38% private)
– Operations focusing on:
• moving more workloads to cloud
• existing cloud usage optimization (cost reduction)
2
• Nowadays Docker is becoming the de-facto standard for Cloud deployments
– lightweight abstraction on system resources
– fast deployment, management and maintenance
– large deployments and automatic orchestration
[1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
Cloud trends
• 2017 State of the cloud [1]:
– 79% of workloads run in cloud (41% public, 38% private)
– Operations focusing on:
• moving more workloads to cloud
• existing cloud usage optimization (cost reduction)
2
• Nowadays Docker is becoming the de-facto standard for Cloud deployments
– lightweight abstraction on system resources
– fast deployment, management and maintenance
– large deployments and automatic orchestration
[1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
3
3
#requests/s
heap size
CPU usage Q(t) λ(t) μ(t)
#store/s#load/s
Infrastructure monitoring (1)
• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
4
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
Infrastructure monitoring (1)
• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
4
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
high visibility on system state
non negligible cost
few information on system state
cheap monitoring
VS
• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
few information on system state
cheap monitoring
high visibility on system state
non negligible cost
Infrastructure monitoring (2) 5
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
VS
High data granularity Good data granularity High data granularity
Code instrumentation Code instrumentation No instrumentation
Low metrics rate High metrics rate High metrics rate
• Container complexity demands strong monitoring capabilities
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
few information on system state
cheap monitoring
high visibility on system state
non negligible cost
Infrastructure monitoring (2) 5
#requests/s
heap size
CPU usage
Q(t) λ(t) μ(t)
#store/s
#load/s
VS
High data granularity Good data granularity High data granularity
Code instrumentation Code instrumentation No instrumentation
Low metrics rate High metrics rate High metrics rate
Sysdig Cloud monitoring 6
http://www.sysdig.org
• Infrastructure for container monitoring
• Collects aggregated metrics and shows system state:
– “Drill-down” from cluster to single application metrics
– Dynamic network topology
– Alerting and anomaly detection
• Monitoring agent deployed on each machine in the cluster
– Traces system calls in a “streaming fashion”
– Aggregates data for Threads, FDs, applications, containers and hosts
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
7
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
Cause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
7
Events arrives at
really high frequency
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
EffectCause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
7
Events arrives at
really high frequency Queues grow
indefinitely
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
IssuesEffectCause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
7
Events arrives at
really high frequency Queues grow
indefinitely
High usage of system
resources
Uncontrolled 

loss of events
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
Output quality
degradation
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
8
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
8
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
8
Load Manager
*when*
Mitigate high usage of
system resources
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
8
Load Manager
*when*
Policy wrapper
*how much*
Mitigate high usage of
system resources
minimize output quality
degradation
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
8
Load Manager
*when*
LS Filter
*where*
Policy wrapper
*how much*
shedding
plan
Mitigate high usage of
system resources
Avoid uncontrolled 

loss of events
minimize output quality
degradation
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
8
Load Manager
*when*
aggregated
metrics
correction
LS Filter
*where*
Policy wrapper
*how much*
shedding
plan
Mitigate high usage of
system resources
Avoid uncontrolled 

loss of events
minimize output quality
degradation
Load Manager 9
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
Utilization-based Load Manager
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 9
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
Utilization-based Load Manager
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 9
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
CPU utilization Arrived events Residual events
Utilization-based Load Manager
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 9
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
Current utilization Target utilization
Utilization-based Load Manager
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 9
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
Arrival rate
Max theoretical
throughput
Control error
Utilization-based Load Manager
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 9
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained,
ice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
rium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
ly and can be greater than the system capacity µc(t), defined as the rate of events
ted per second. Given the control action µ(t) (i.e., the throughput of the system)
e system capacity, we can define µd(t) as the dropping rate of the LS. As we did
), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
The requested throughput is used by the load shedding policies to derive the LS probabilities
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
10
Load Manager
LS Filter
Policies
SP
Metrics
correction
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
10
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
10
Fair policy
• Assign to each process the “same" number 

of events
• Save metrics of small processes, still
accurate results on big ones
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
10
Fair policy
• Assign to each process the “same" number 

of events
• Save metrics of small processes, still
accurate results on big ones
Priority-based policy
• Assign a static priority to each process
• Compute a weighted priority to partition
the system capacity
• Assign a partition to each process and
compute the probabilities
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))
Load Shedding Filter
• The Load Shedding Filter applies the probabilities 

computed by the policies to the input stream
11
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding Filter
• The Load Shedding Filter applies the probabilities 

computed by the policies to the input stream
• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
11
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko
Load Shedding Filter
• The Load Shedding Filter applies the probabilities 

computed by the policies to the input stream
• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
• The dropped events are reported to the application for metrics correction
11
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 13)
• Output quality (slides 14 15 16 17)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 12
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 13)
• Output quality (slides 14 15 16 17)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 12
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 13)
• Output quality (slides 14 15 16 17)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 12
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 13)
• Output quality (slides 14 15 16 17)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 12
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite
System stability 13
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load

with the QoS requirement (Ut)
• Error measured with MAPE (lower 

is better) obtained running 20 times 

each benchmark
• 3.51x average MAPE improvement,

average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
System stability 13
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load

with the QoS requirement (Ut)
• Error measured with MAPE (lower 

is better) obtained running 20 times 

each benchmark
• 3.51x average MAPE improvement,

average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
System stability 13
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load

with the QoS requirement (Ut)
• Error measured with MAPE (lower 

is better) obtained running 20 times 

each benchmark
• 3.51x average MAPE improvement,

average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
Output quality - heterogeneous
• We tried to mix the homogeneous tests
• simulate co-located environment
• add OS scheduling uncertainty and noise
• QoS requirement Ut 1.1%
• MAPE (lower is better) between exact and approximated metrics
• Compare metrics from reference, FFWD fair, FFWD priority
• Three tests with different syscall mix:
• Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s
• Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s
• Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
14
Output quality - heterogeneous
• We tried to mix the homogeneous tests
• simulate co-located environment
• add OS scheduling uncertainty and noise
• QoS requirement Ut 1.1%
• MAPE (lower is better) between exact and approximated metrics
• Compare metrics from reference, FFWD fair, FFWD priority
• Three tests with different syscall mix:
• Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s
• Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s
• Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
14
1x Fio, 3x Nginx, 1.3M evt/s 15
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Fio, 3x Nginx, 1.3M evt/s 15
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Fio, 3x Nginx, 1.3M evt/s 15
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
17
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s
• Fair policy outperforms reference in almost all cases
• the LS Filter works at the single event level
• reference drops events in batches
• Priority policy improves the Fair policy results in most cases
• the prioritized processes are privileged
• other processes treated as “best-effort”
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
18
Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
• Fast Forward With Degradation (FFWD)
– Heuristic controller for bounded CPU usage
– Pluggable policies for domain-specific load shedding
– Accurate computation of output metrics
– Load Shedding Filter for fast drop of events
18
19
Questions?
Rolando Brondolin, rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
NGC VIII 2017 @ SF
FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D.
Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)
20
BACKUP SLIDES
21
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
22
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
22
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
22
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
22
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
22
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
1x simplefile, 1x nginx, 1.3M evt/s 23
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
1x simplefile, 1x nginx, 1.3M evt/s 23
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
1x simplefile, 1x nginx, 1.3M evt/s 23
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
1x simplefile, 1x nginx, 1.3M evt/s 23
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
Response time Load Manager 24
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
25
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
Old response time Target response time
Response time Load Manager
26
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
Requested throughput Arrival rate
Control error
Response time Load Manager
Case studies 27
System monitoring [2]
• Goal: Distributed monitoring of systems
and applications w/syscalls
• Constraint: CPU utilization
• Based on: Sysdig monitoring agent
• Output: aggregated performance metrics
for applications, containers, hosts
• FFWD ensures low CPU overhead
• policies based on processes in the system
[1] http://nlp.stanford.edu [2] http://www.sysdig.org
Sentiment analysis [1]
• Goal: perform real-time analysis on tweets
Case studies 28
System monitoring [2]
• Goal: Distributed monitoring of systems
[1] http://nlp.stanford.edu [2] http://www.sysdig.org
Sentiment analysis [1]
• Goal: perform real-time analysis on tweets
• Constraint: Latency
• Based on: Stanford NLP toolkit
• Output: aggregated sentiment score for
each keyword and hashtag
• FFWD maintains limited the response time
• policies on tweet keyword and #hashtag
Real-time sentiment analysis 29
• Real-time sentiment analysis allows to:
– Track the sentiment of a topic over time
– Correlate real world events and related sentiment, e.g.
• Toyota crisis (2010) [1]
• 2012 US Presidential Election Cycle [2]
– Track online evolution of companies reputation, derive social
profiling and allow enhanced social marketing strategies
[1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research:
Workshop and Conference Proceedings Series. 2011.
[2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL
2012 System Demonstrations.
Sentiment analysis: case study 30
• Simple Twitter streaming sentiment analyzer with Stanford NLP
• System components:
– Event producer
– RabbitMQ queue
– Event consumer
• Consumer components:
– Event Capture
– Sentiment Analyzer
– Sentiment Aggregator
• Real-time queue consumption, aggregated metrics emission each second
(keywords and hashtag sentiment)
FFWD: Sentiment analysis 31
• FFWD adds four components:
– Load shedding filter at the beginning of the pipeline
– Shedding plan used by the filter
– Domain-specific policy wrapper
– Application controller manager to detect load peaks
Producer
Load Shedding
Filter
Event
Capture
Sentiment
Analyzer
Sentiment
Aggregator
Policy
Wrapper
Load Manager
Shedding
Plan
real-time queue
batch queue
ok
ko
ko count
account metrics
R(t)
stream statsupdated plan
μ(t+1)
event output metricsinput tweets
drop probability
Component
Data structure
Internal information flow
External information flow
Queue
analyze event
λ(t)
Rt
Sentiment - experimental setup 32
• Separate tests to understand FFWD behavior:
– System stability
– Output quality
• Dataset: 900K tweets of 35th week of Premier League
• Performed tests:
– Controller: synthetic and real tweets at various λ(t)
– Policy: real tweets at various λ(t)
• Evaluation setup
– Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC
– 8 GB RAM @ 1600 Mhz
System stability 33
case A: λ(t) = λ(t-1)
case B: λ(t) = avg(λ(t))
λ(t) estimation:
Load Manager showcase (1)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– response time:
34
0
1
2
3
4
5
6
7
0 50 100 150 200 250 300
Responsetime(s)
time (s)
Controller performance
QoS = 5s
R
Load Manager showcase (2)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– throughput:
35
0
100
200
300
400
500
0 50 100 150 200 250 300
#Events
time (s)
Actuation
lambda
dropped
computed
mu
Output Quality 36
• Real tweets, μc(t) ≃ 40 evt/s
• Evaluated policies:
• Baseline
• Fair
• Priority
• R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s
• Error metric: Mean Absolute Percentage
Error (MAPE %) (lower is better)
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 100 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 200 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 400 evt/s

More Related Content

What's hot

Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...Tiziano De Matteis
 
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSES
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSESREGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSES
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSESijsrd.com
 
Load balancing
Load balancingLoad balancing
Load balancingSoujanya V
 
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...Rhino: Efficient Management of Very Large Distributed State for Stream Proces...
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...Bonaventura Del Monte
 
Chapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationChapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationWayne Jones Jnr
 
Cs 704 d aos-resource&processmanagement
Cs 704 d aos-resource&processmanagementCs 704 d aos-resource&processmanagement
Cs 704 d aos-resource&processmanagementDebasis Das
 
Operating System : Ch18 distributed coordination
Operating System : Ch18 distributed coordinationOperating System : Ch18 distributed coordination
Operating System : Ch18 distributed coordinationSyaiful Ahdan
 
From data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFrom data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFogGuru MSCA Project
 
Synchronization
SynchronizationSynchronization
SynchronizationSara shall
 
Process Migration in Heterogeneous Systems
Process Migration in Heterogeneous SystemsProcess Migration in Heterogeneous Systems
Process Migration in Heterogeneous Systemsijsrd.com
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordinationsiva krishna
 
On Demand Time Sychronizaton for Wireless Sensor Networks-november2009
On Demand Time Sychronizaton for Wireless Sensor Networks-november2009On Demand Time Sychronizaton for Wireless Sensor Networks-november2009
On Demand Time Sychronizaton for Wireless Sensor Networks-november2009abhiumn
 

What's hot (20)

Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
 
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSES
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSESREGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSES
REGION BASED DATA CENTRE RESOURCE ANALYSIS FOR BUSINESSES
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
Load balancing
Load balancingLoad balancing
Load balancing
 
Resource management
Resource managementResource management
Resource management
 
Ijariie1161
Ijariie1161Ijariie1161
Ijariie1161
 
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...Rhino: Efficient Management of Very Large Distributed State for Stream Proces...
Rhino: Efficient Management of Very Large Distributed State for Stream Proces...
 
[IJET-V1I5P2] Authors :Hind HazzaAlsharif , Razan Hamza Bawareth
[IJET-V1I5P2] Authors :Hind HazzaAlsharif , Razan Hamza Bawareth[IJET-V1I5P2] Authors :Hind HazzaAlsharif , Razan Hamza Bawareth
[IJET-V1I5P2] Authors :Hind HazzaAlsharif , Razan Hamza Bawareth
 
Chapter 18 - Distributed Coordination
Chapter 18 - Distributed CoordinationChapter 18 - Distributed Coordination
Chapter 18 - Distributed Coordination
 
Cs 704 d aos-resource&processmanagement
Cs 704 d aos-resource&processmanagementCs 704 d aos-resource&processmanagement
Cs 704 d aos-resource&processmanagement
 
Operating System : Ch18 distributed coordination
Operating System : Ch18 distributed coordinationOperating System : Ch18 distributed coordination
Operating System : Ch18 distributed coordination
 
From data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFrom data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloud
 
Synchronization
SynchronizationSynchronization
Synchronization
 
Chapter00000000
Chapter00000000Chapter00000000
Chapter00000000
 
Process Migration in Heterogeneous Systems
Process Migration in Heterogeneous SystemsProcess Migration in Heterogeneous Systems
Process Migration in Heterogeneous Systems
 
Distributed systems scheduling
Distributed systems schedulingDistributed systems scheduling
Distributed systems scheduling
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
On Demand Time Sychronizaton for Wireless Sensor Networks-november2009
On Demand Time Sychronizaton for Wireless Sensor Networks-november2009On Demand Time Sychronizaton for Wireless Sensor Networks-november2009
On Demand Time Sychronizaton for Wireless Sensor Networks-november2009
 
Clocks
ClocksClocks
Clocks
 

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies

FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationRolando Brondolin
 
Advanced Automated Approach for Interconnected Power System Congestion Forecast
Advanced Automated Approach for Interconnected Power System Congestion ForecastAdvanced Automated Approach for Interconnected Power System Congestion Forecast
Advanced Automated Approach for Interconnected Power System Congestion ForecastPower System Operation
 
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...csandit
 
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...Matteo Ferroni
 
Callidus Software On-Premise To On-Demand Migration
Callidus Software On-Premise To On-Demand MigrationCallidus Software On-Premise To On-Demand Migration
Callidus Software On-Premise To On-Demand MigrationCallidus Software
 
AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices
AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT DevicesAdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices
AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT DevicesDemetris Trihinas
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopbalmanme
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...
SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...
SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...SERENEWorkshop
 
ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...
ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...
ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...IRJET Journal
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Peter Tröger
 
Design Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP OverloadDesign Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP OverloadYang Hong
 
Winds of change from vendor lock-in to meta cloud review 1
Winds of change from  vendor lock-in to meta cloud review 1Winds of change from  vendor lock-in to meta cloud review 1
Winds of change from vendor lock-in to meta cloud review 1NAWAZ KHAN
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
Best Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDBBest Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDBInfluxData
 
T1-4_Maslennikov_et_al.pdf
T1-4_Maslennikov_et_al.pdfT1-4_Maslennikov_et_al.pdf
T1-4_Maslennikov_et_al.pdfMareLunare
 
Enabling Carrier-Grade Availability Within a Cloud Infrastructure
Enabling Carrier-Grade Availability Within a Cloud InfrastructureEnabling Carrier-Grade Availability Within a Cloud Infrastructure
Enabling Carrier-Grade Availability Within a Cloud InfrastructureOPNFV
 

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies (20)

FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With Degradation
 
Advanced Automated Approach for Interconnected Power System Congestion Forecast
Advanced Automated Approach for Interconnected Power System Congestion ForecastAdvanced Automated Approach for Interconnected Power System Congestion Forecast
Advanced Automated Approach for Interconnected Power System Congestion Forecast
 
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
 
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
 
Callidus Software On-Premise To On-Demand Migration
Callidus Software On-Premise To On-Demand MigrationCallidus Software On-Premise To On-Demand Migration
Callidus Software On-Premise To On-Demand Migration
 
AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices
AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT DevicesAdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices
AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
 
data-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptxdata-stream-processing-SEEP.pptx
data-stream-processing-SEEP.pptx
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...
SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...
SERENE 2014 Workshop: Paper "Modelling Resilience of Data Processing Capabili...
 
ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...
ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...
ANALYSIS AND EXPERIMENTAL EVALUATION OF THE TRANSMISSION CONTROL PROTOCOL CON...
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
Design Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP OverloadDesign Of A PI Rate Controller For Mitigating SIP Overload
Design Of A PI Rate Controller For Mitigating SIP Overload
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Winds of change from vendor lock-in to meta cloud review 1
Winds of change from  vendor lock-in to meta cloud review 1Winds of change from  vendor lock-in to meta cloud review 1
Winds of change from vendor lock-in to meta cloud review 1
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Best Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDBBest Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDB
 
Linux capacity planning
Linux capacity planningLinux capacity planning
Linux capacity planning
 
T1-4_Maslennikov_et_al.pdf
T1-4_Maslennikov_et_al.pdfT1-4_Maslennikov_et_al.pdf
T1-4_Maslennikov_et_al.pdf
 
Enabling Carrier-Grade Availability Within a Cloud Infrastructure
Enabling Carrier-Grade Availability Within a Cloud InfrastructureEnabling Carrier-Grade Availability Within a Cloud Infrastructure
Enabling Carrier-Grade Availability Within a Cloud Infrastructure
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdfKamal Acharya
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfAbrahamGadissa
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdfKamal Acharya
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfAyahmorsy
 
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisIT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisDr. Radhey Shyam
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfKamal Acharya
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234AafreenAbuthahir2
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdfKamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdfKamal Acharya
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerapareshmondalnita
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdfKamal Acharya
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industriesMuhammadTufail242431
 

Recently uploaded (20)

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisIT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answer
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 

Self-adaptive container monitoring with performance-aware Load-Shedding policies

  • 1. 1 Self-adaptive container monitoring with performance-aware load-shedding policies NECST Group Conference 2017 @ Pinterest 06/05/2017 Rolando Brondolin rolando.brondolin@polimi.it DEIB, Politecnico di Milano
  • 2. Cloud trends • 2017 State of the cloud [1]: – 79% of workloads run in cloud (41% public, 38% private) – Operations focusing on: • moving more workloads to cloud • existing cloud usage optimization (cost reduction) 2 [1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
  • 3. Cloud trends • 2017 State of the cloud [1]: – 79% of workloads run in cloud (41% public, 38% private) – Operations focusing on: • moving more workloads to cloud • existing cloud usage optimization (cost reduction) 2 • Nowadays Docker is becoming the de-facto standard for Cloud deployments – lightweight abstraction on system resources – fast deployment, management and maintenance – large deployments and automatic orchestration [1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
  • 4. Cloud trends • 2017 State of the cloud [1]: – 79% of workloads run in cloud (41% public, 38% private) – Operations focusing on: • moving more workloads to cloud • existing cloud usage optimization (cost reduction) 2 • Nowadays Docker is becoming the de-facto standard for Cloud deployments – lightweight abstraction on system resources – fast deployment, management and maintenance – large deployments and automatic orchestration [1] Cloud Computing Trends: 2017 State of the Cloud Survey, Kim Weins, Rightscale
  • 5. 3
  • 6. 3 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s#load/s
  • 7. Infrastructure monitoring (1) • Container complexity demands strong monitoring capabilities – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption 4 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s #load/s
  • 8. Infrastructure monitoring (1) • Container complexity demands strong monitoring capabilities – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption 4 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s #load/s high visibility on system state non negligible cost few information on system state cheap monitoring VS
  • 9. • Container complexity demands strong monitoring capabilities – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption few information on system state cheap monitoring high visibility on system state non negligible cost Infrastructure monitoring (2) 5 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s #load/s VS High data granularity Good data granularity High data granularity Code instrumentation Code instrumentation No instrumentation Low metrics rate High metrics rate High metrics rate
  • 10. • Container complexity demands strong monitoring capabilities – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption few information on system state cheap monitoring high visibility on system state non negligible cost Infrastructure monitoring (2) 5 #requests/s heap size CPU usage Q(t) λ(t) μ(t) #store/s #load/s VS High data granularity Good data granularity High data granularity Code instrumentation Code instrumentation No instrumentation Low metrics rate High metrics rate High metrics rate
  • 11. Sysdig Cloud monitoring 6 http://www.sysdig.org • Infrastructure for container monitoring • Collects aggregated metrics and shows system state: – “Drill-down” from cluster to single application metrics – Dynamic network topology – Alerting and anomaly detection • Monitoring agent deployed on each machine in the cluster – Traces system calls in a “streaming fashion” – Aggregates data for Threads, FDs, applications, containers and hosts
  • 12. Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 7 S λ(t) φ(t) μ(t) Λ Φ Q
  • 13. Cause Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 7 Events arrives at really high frequency S λ(t) φ(t) μ(t) Λ Φ Q S φ(t) μ(t) Φ Q of a streaming system with queue, processing element and streaming output flow . A server S, fed by a queue Q, is in overloading eater than the service rate µ(t). The stability condition stated he necessary and sufficient condition to avoid overloading. A ncing overloading should discard part of the input to increase to match the arrival rate (t). µ(t)  (t) (2.1) rmalizing is twofold, as we are interested not only in controlling t also in maximizing the accuracy of the estimated metrics. To which represents the input flow at a given time t; and ˜x, which ut flow considered in case of overloading at the same time t. If
  • 14. EffectCause Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 7 Events arrives at really high frequency Queues grow indefinitely S λ(t) φ(t) μ(t) Λ Φ Q S φ(t) μ(t) Φ Q of a streaming system with queue, processing element and streaming output flow . A server S, fed by a queue Q, is in overloading eater than the service rate µ(t). The stability condition stated he necessary and sufficient condition to avoid overloading. A ncing overloading should discard part of the input to increase to match the arrival rate (t). µ(t)  (t) (2.1) rmalizing is twofold, as we are interested not only in controlling t also in maximizing the accuracy of the estimated metrics. To which represents the input flow at a given time t; and ˜x, which ut flow considered in case of overloading at the same time t. If
  • 15. IssuesEffectCause Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 7 Events arrives at really high frequency Queues grow indefinitely High usage of system resources Uncontrolled 
 loss of events S λ(t) φ(t) μ(t) Λ Φ Q S φ(t) μ(t) Φ Q of a streaming system with queue, processing element and streaming output flow . A server S, fed by a queue Q, is in overloading eater than the service rate µ(t). The stability condition stated he necessary and sufficient condition to avoid overloading. A ncing overloading should discard part of the input to increase to match the arrival rate (t). µ(t)  (t) (2.1) rmalizing is twofold, as we are interested not only in controlling t also in maximizing the accuracy of the estimated metrics. To which represents the input flow at a given time t; and ˜x, which ut flow considered in case of overloading at the same time t. If Output quality degradation
  • 16. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques 8
  • 17. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 8
  • 18. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 8 Load Manager *when* Mitigate high usage of system resources
  • 19. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 8 Load Manager *when* Policy wrapper *how much* Mitigate high usage of system resources minimize output quality degradation
  • 20. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 8 Load Manager *when* LS Filter *where* Policy wrapper *how much* shedding plan Mitigate high usage of system resources Avoid uncontrolled 
 loss of events minimize output quality degradation
  • 21. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 8 Load Manager *when* aggregated metrics correction LS Filter *where* Policy wrapper *how much* shedding plan Mitigate high usage of system resources Avoid uncontrolled 
 loss of events minimize output quality degradation
  • 22. Load Manager 9 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained, ice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system rium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- ly and can be greater than the system capacity µc(t), defined as the rate of events ted per second. Given the control action µ(t) (i.e., the throughput of the system) e system capacity, we can define µd(t) as the dropping rate of the LS. As we did ), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
  • 23. Utilization-based Load Manager The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 9 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained, ice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system rium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- ly and can be greater than the system capacity µc(t), defined as the rate of events ted per second. Given the control action µ(t) (i.e., the throughput of the system) e system capacity, we can define µd(t) as the dropping rate of the LS. As we did ), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
  • 24. Utilization-based Load Manager The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 9 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained, ice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system rium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- ly and can be greater than the system capacity µc(t), defined as the rate of events ted per second. Given the control action µ(t) (i.e., the throughput of the system) e system capacity, we can define µd(t) as the dropping rate of the LS. As we did ), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory CPU utilization Arrived events Residual events
  • 25. Utilization-based Load Manager The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 9 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained, ice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system rium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- ly and can be greater than the system capacity µc(t), defined as the rate of events ted per second. Given the control action µ(t) (i.e., the throughput of the system) e system capacity, we can define µd(t) as the dropping rate of the LS. As we did ), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory Current utilization Target utilization
  • 26. Utilization-based Load Manager The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 9 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained, ice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system rium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- ly and can be greater than the system capacity µc(t), defined as the rate of events ted per second. Given the control action µ(t) (i.e., the throughput of the system) e system capacity, we can define µd(t) as the dropping rate of the LS. As we did ), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory Arrival rate Max theoretical throughput Control error
  • 27. Utilization-based Load Manager The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 9 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: we analyze the formulation for the Load Manager’s actuation µ(t+1) just obtained, ice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system rium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- ly and can be greater than the system capacity µc(t), defined as the rate of events ted per second. Given the control action µ(t) (i.e., the throughput of the system) e system capacity, we can define µd(t) as the dropping rate of the LS. As we did ), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The tion based Load Manager, which is showed in Figure 4.4, resorts to queuing theory The requested throughput is used by the load shedding policies to derive the LS probabilities
  • 28. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 10 Load Manager LS Filter Policies SP Metrics correction
  • 29. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 10 Load Manager LS Filter Policies SP Metrics correction Baseline policy • Compute one LS probability for all processes (with μ(t+1) and μc(t))
  • 30. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 10 Fair policy • Assign to each process the “same" number 
 of events • Save metrics of small processes, still accurate results on big ones Load Manager LS Filter Policies SP Metrics correction Baseline policy • Compute one LS probability for all processes (with μ(t+1) and μc(t))
  • 31. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 10 Fair policy • Assign to each process the “same" number 
 of events • Save metrics of small processes, still accurate results on big ones Priority-based policy • Assign a static priority to each process • Compute a weighted priority to partition the system capacity • Assign a partition to each process and compute the probabilities Load Manager LS Filter Policies SP Metrics correction Baseline policy • Compute one LS probability for all processes (with μ(t+1) and μc(t))
  • 32. Load Shedding Filter • The Load Shedding Filter applies the probabilities 
 computed by the policies to the input stream 11 Load Manager LS Filter Policies SP Metrics correction
  • 33. Load Shedding Filter • The Load Shedding Filter applies the probabilities 
 computed by the policies to the input stream • For each event: • Look for load shedding probability depending on input class • If no data is found we can drop the event • Otherwise, apply the Load Shedding probability computed by the policy 11 Load Manager LS Filter Policies SP Metrics correction Load Shedding Filter Shedding Plan event buffers ok drop probability Event Capture ko
  • 34. Load Shedding Filter • The Load Shedding Filter applies the probabilities 
 computed by the policies to the input stream • For each event: • Look for load shedding probability depending on input class • If no data is found we can drop the event • Otherwise, apply the Load Shedding probability computed by the policy • The dropped events are reported to the application for metrics correction 11 Load Manager LS Filter Policies SP Metrics correction Load Shedding Filter Shedding Plan event buffers ok drop probability Event Capture ko
  • 35. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 13) • Output quality (slides 14 15 16 17) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 12
  • 36. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 13) • Output quality (slides 14 15 16 17) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 12 test ID name priority # evts/s A nginx 3 800K B postmark 4 1,2M C fio 4 1,3M D simplefile 2 1,5M E apache 2 1,9M test ID instances # evts/s F 3x nginx, 1x fio 1,3M G 1x nginx, 1x simplefile 1,3M H 1x apache, 2x postmark, 1x fio 1,8M Homogeneous benchmarks Heterogeneous benchmarks Syscall intensive benchmarks from Phoronix test suite
  • 37. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 13) • Output quality (slides 14 15 16 17) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 12 test ID name priority # evts/s A nginx 3 800K B postmark 4 1,2M C fio 4 1,3M D simplefile 2 1,5M E apache 2 1,9M test ID instances # evts/s F 3x nginx, 1x fio 1,3M G 1x nginx, 1x simplefile 1,3M H 1x apache, 2x postmark, 1x fio 1,8M Homogeneous benchmarks Heterogeneous benchmarks Syscall intensive benchmarks from Phoronix test suite
  • 38. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 13) • Output quality (slides 14 15 16 17) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 12 test ID name priority # evts/s A nginx 3 800K B postmark 4 1,2M C fio 4 1,3M D simplefile 2 1,5M E apache 2 1,9M test ID instances # evts/s F 3x nginx, 1x fio 1,3M G 1x nginx, 1x simplefile 1,3M H 1x apache, 2x postmark, 1x fio 1,8M Homogeneous benchmarks Heterogeneous benchmarks Syscall intensive benchmarks from Phoronix test suite
  • 39. System stability 13 • We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G) • With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity) • Measuring the CPU load of the sysdig agent with: • reference implementation • FFWD with fair and priority policy • We compared the actual CPU load
 with the QoS requirement (Ut) • Error measured with MAPE (lower 
 is better) obtained running 20 times 
 each benchmark • 3.51x average MAPE improvement,
 average MAPE below 5% Test Ut = 1.1% reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 40. System stability 13 • We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G) • With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity) • Measuring the CPU load of the sysdig agent with: • reference implementation • FFWD with fair and priority policy • We compared the actual CPU load
 with the QoS requirement (Ut) • Error measured with MAPE (lower 
 is better) obtained running 20 times 
 each benchmark • 3.51x average MAPE improvement,
 average MAPE below 5% Test Ut = 1.1% reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 41. System stability 13 • We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G) • With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity) • Measuring the CPU load of the sysdig agent with: • reference implementation • FFWD with fair and priority policy • We compared the actual CPU load
 with the QoS requirement (Ut) • Error measured with MAPE (lower 
 is better) obtained running 20 times 
 each benchmark • 3.51x average MAPE improvement,
 average MAPE below 5% Test Ut = 1.1% reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 42. Output quality - heterogeneous • We tried to mix the homogeneous tests • simulate co-located environment • add OS scheduling uncertainty and noise • QoS requirement Ut 1.1% • MAPE (lower is better) between exact and approximated metrics • Compare metrics from reference, FFWD fair, FFWD priority • Three tests with different syscall mix: • Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s • Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s • Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 14
  • 43. Output quality - heterogeneous • We tried to mix the homogeneous tests • simulate co-located environment • add OS scheduling uncertainty and noise • QoS requirement Ut 1.1% • MAPE (lower is better) between exact and approximated metrics • Compare metrics from reference, FFWD fair, FFWD priority • Three tests with different syscall mix: • Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s • Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s • Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 14
  • 44. 1x Fio, 3x Nginx, 1.3M evt/s 15 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics MAPE lower is better
  • 45. 1x Fio, 3x Nginx, 1.3M evt/s 15 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics MAPE lower is better
  • 46. 1x Fio, 3x Nginx, 1.3M evt/s 15 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics MAPE lower is better
  • 47. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 48. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 49. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 50. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 16 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 51. 17 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s • Fair policy outperforms reference in almost all cases • the LS Filter works at the single event level • reference drops events in batches • Priority policy improves the Fair policy results in most cases • the prioritized processes are privileged • other processes treated as “best-effort” 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
  • 52. Conclusion • We saw the main challenges of Load Shedding for container monitoring – Low overhead monitoring – High quality and granularity of metrics 18
  • 53. Conclusion • We saw the main challenges of Load Shedding for container monitoring – Low overhead monitoring – High quality and granularity of metrics • Fast Forward With Degradation (FFWD) – Heuristic controller for bounded CPU usage – Pluggable policies for domain-specific load shedding – Accurate computation of output metrics – Load Shedding Filter for fast drop of events 18
  • 54. 19 Questions? Rolando Brondolin, rolando.brondolin@polimi.it DEIB, Politecnico di Milano NGC VIII 2017 @ SF FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D. Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)
  • 55. 20
  • 57. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations 22
  • 58. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 22 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 59. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 22 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 60. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 22 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 61. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 22 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 62. 1x simplefile, 1x nginx, 1.3M evt/s 23 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 63. 1x simplefile, 1x nginx, 1.3M evt/s 23 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 64. 1x simplefile, 1x nginx, 1.3M evt/s 23 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 65. 1x simplefile, 1x nginx, 1.3M evt/s 23 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 66. Response time Load Manager 24 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities
  • 67. 25 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities Old response time Target response time Response time Load Manager
  • 68. 26 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities Requested throughput Arrival rate Control error Response time Load Manager
  • 69. Case studies 27 System monitoring [2] • Goal: Distributed monitoring of systems and applications w/syscalls • Constraint: CPU utilization • Based on: Sysdig monitoring agent • Output: aggregated performance metrics for applications, containers, hosts • FFWD ensures low CPU overhead • policies based on processes in the system [1] http://nlp.stanford.edu [2] http://www.sysdig.org Sentiment analysis [1] • Goal: perform real-time analysis on tweets
  • 70. Case studies 28 System monitoring [2] • Goal: Distributed monitoring of systems [1] http://nlp.stanford.edu [2] http://www.sysdig.org Sentiment analysis [1] • Goal: perform real-time analysis on tweets • Constraint: Latency • Based on: Stanford NLP toolkit • Output: aggregated sentiment score for each keyword and hashtag • FFWD maintains limited the response time • policies on tweet keyword and #hashtag
  • 71. Real-time sentiment analysis 29 • Real-time sentiment analysis allows to: – Track the sentiment of a topic over time – Correlate real world events and related sentiment, e.g. • Toyota crisis (2010) [1] • 2012 US Presidential Election Cycle [2] – Track online evolution of companies reputation, derive social profiling and allow enhanced social marketing strategies [1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research: Workshop and Conference Proceedings Series. 2011. [2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL 2012 System Demonstrations.
  • 72. Sentiment analysis: case study 30 • Simple Twitter streaming sentiment analyzer with Stanford NLP • System components: – Event producer – RabbitMQ queue – Event consumer • Consumer components: – Event Capture – Sentiment Analyzer – Sentiment Aggregator • Real-time queue consumption, aggregated metrics emission each second (keywords and hashtag sentiment)
  • 73. FFWD: Sentiment analysis 31 • FFWD adds four components: – Load shedding filter at the beginning of the pipeline – Shedding plan used by the filter – Domain-specific policy wrapper – Application controller manager to detect load peaks Producer Load Shedding Filter Event Capture Sentiment Analyzer Sentiment Aggregator Policy Wrapper Load Manager Shedding Plan real-time queue batch queue ok ko ko count account metrics R(t) stream statsupdated plan μ(t+1) event output metricsinput tweets drop probability Component Data structure Internal information flow External information flow Queue analyze event λ(t) Rt
  • 74. Sentiment - experimental setup 32 • Separate tests to understand FFWD behavior: – System stability – Output quality • Dataset: 900K tweets of 35th week of Premier League • Performed tests: – Controller: synthetic and real tweets at various λ(t) – Policy: real tweets at various λ(t) • Evaluation setup – Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC – 8 GB RAM @ 1600 Mhz
  • 75. System stability 33 case A: λ(t) = λ(t-1) case B: λ(t) = avg(λ(t)) λ(t) estimation:
  • 76. Load Manager showcase (1) • Load Manager demo (Rt = 5s): – λ(t) increased after 60s and 240s – response time: 34 0 1 2 3 4 5 6 7 0 50 100 150 200 250 300 Responsetime(s) time (s) Controller performance QoS = 5s R
  • 77. Load Manager showcase (2) • Load Manager demo (Rt = 5s): – λ(t) increased after 60s and 240s – throughput: 35 0 100 200 300 400 500 0 50 100 150 200 250 300 #Events time (s) Actuation lambda dropped computed mu
  • 78. Output Quality 36 • Real tweets, μc(t) ≃ 40 evt/s • Evaluated policies: • Baseline • Fair • Priority • R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s • Error metric: Mean Absolute Percentage Error (MAPE %) (lower is better) 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 100 evt/s 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 200 evt/s 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 400 evt/s