SlideShare a Scribd company logo
1 of 76
Download to read offline
Vladimir Podolskiy*, Michael Mayo**, Abigail Koay**,
Michael Gerndt*, Panos Patros**
*Technical University of Munich (TUM), Germany
**University of Waikato, New Zealand
IEEE SASO 2019
Umeå, Sweden, June 18th 2019
Full Paper
Cloud-based Adaptation
Maintaining SLOs of Cloud-native Applications
via Self-Adaptive Resource Sharing
2Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
We are…
Vladimir Podolskiy
3rd-year PhD, TUM
Predictive
autoscaling and
anomaly detection
Panos Patros
Lecturer in
Software
Engineering
Head of ORCA lab
• ORCA Lab at Waikato, NZ
• Started in Jan 2019
• 1 Research Assistant
• 9 Research Students
• 7 Faculty
• 2 Interns
• 8 International Collaborators (Canada, Germany, USA)
• New graduate-level course (70% Project Component)
• COMPX529 Engineering Self-Adaptive Systems
3Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Oceania Researchers in Cloud and Adaptive-systems
Ohu Rangahau Kapua Aunoa
• Background:
 Containerization & Cloud Computing
 Resource Sharing & Container Orchestration via Kubernetes
 Machine Learning & Lasso Regression
• Motivation of the Study
• Research Problem
• Data Collection
• Proposed Approach:
 Anomalies Removal
 Prediction of the Service Level Indicators
 SLO-compliant Resource Allocation
• Evaluation:
 Method
 Results
 Limitations
• Usage Scenario: Augmenting Vertical Pods Autoscaler (VPA) of Kubernetes
• Conclusions & Future Work
4Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Contents
Introduction:
Background
Motivation
Research Problem
5Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
• Cloud
• Abstracts Computing Resources
• Containers
• Low-overhead OS-level virtualization
• Multitenancy
• Containerization of apps
• Kuberneres
• Orchestration of app containers
• Resource Management
• Soft (e.g. CPU shares) and hard (e.g. LFS) limits
• Machine Learning
• Least Absolute Shrinkage and Selection Operator (LASSO)
• Linear Regression, shrinks data towards mean
6Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Background
• Satisfyings requirements at varying loads
• Engineering is for the people
• Business, Environment and Society Sustainability
• Cloud Service Level Agreements (SLAs)
• Service Level Objectives (SLOs)
• Financial penalties
• Service Level Indicators (SLIs)
• Tradeoffs
• Performance
• Resource Consumption and Cost
• Isolation and Security
• State of the art: scaling out (adding containers)
7Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Motivation
• Consider a saturated container cloud
• Little/no benefit from scaling out
• Or up (finite pie)
• Instead, change resource limits
• However, CPU utilization SLIs do not mean much to end-users
• Instead, use response time and throughput
• However, hard to autonomously map to limits
• Therefore, main contribution
• Collect dataset from multitenant deployments
• Detect and remove anomalies (GC, etc.)
• Machine Learn RT/Thru performance models
• Resize containers based on loads and target SLOs
8Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Research Problem
Data Collection:
Testbed
Applications
Load Driving
9Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
• Machine provided by STRATUS (cybersecurity) project
• Availability is a key cybersecurity requirement!
• 24-CPU Intel Xeon
• 256GB RAM
• Private and local cloud
• Performance isolation is imperative
• Can„t rely on public clouds
• 8 VMs (4 CPUs + 4GB RAM)
• 1 master, 8 workers
• Kubernetes using Oracle„s Vagrant Script
• Load was driven from a separate machine
• Performance isolation is imperative! 
10Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Testbed
1. NGINX (app1)
• Single container image
• Replicated 7 times (Load balancer Kubernetes service
exposed)
2. IBM Webshpere Liberty Profile (app2)
• Single container image: runs IBM JVM + Liberty Profile
• Replicated 7 times (Load balancer Kubernetes service
exposed)
3. Redis + PHP Guestbook
• PHP x7, Redis-Master x1, Redis-Slave x7
11Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Deployed Applications
• Separate Machine provided by University of Waikato
• 8-CPU Intel Xeon
• 16GB RAM
• Stress-Testing Script (dataset creation):
1. Select random workloads
2. Select random CPU limits (soft and hard)
3. Redeploy apps
4. Fire requests using Apache ab x16
5. Collect RT and Thru SLIs
6. Repeat x500
12Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Load Driving
Approach to Maintain SLOs of Cloud-native
Applications via Self-Adaptive Resource Sharing:
Anomalies Removal
Prediction of the Service Level Indicators
SLO-compliant Resource Allocation
Limitations of the Approach
13Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
14Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
MAPE-K Inspired Architecture of the Solution
15Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
16Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
• Any data-based method, including prediction, is as good as the data
given to it (Garbage-in-Garbage-our principle)
 If the input does not contain the information to describe the output,
then the model produced by any approach won‟t be accurate
17Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
• Solution:
18Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
Add Missing
Information to the Input
(collect anew?)
Remove
the Intractable Data
from the Output
OR
19Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
GOAL – to leave only the explainable
(~normal distribution)
Remove
the Intractable Data
from the Output
20Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
1. Expectation-Maximization (EM)
Clustering with 10-fold cross-
validation to get the clusters of
similar observations
2. Find the cluster that represents
the anomalies (“too high” and
“too low” SLI values + high
standard deviation)
3. Remove the data grouped into
this cluster from the dataset and
try to fit the model to see
whether R2 score improves.
~13% of anomalies to be removed
21Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Approach
Alternative Isolation Forest
approach gives similar ~11%
22Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Result
23Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
• Performance model allows us to answer the following question
 What performance can be achieved for the given configuration?
… without testing all the possible options.
24Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
• Performance model allows us to answer the following question
 What performance can be achieved for the given configuration?
… without testing all the possible options.
• In our context that question would sound:
 What service level indicators values (throughput and 99%-tile
response time) can be achieved for the workload rate and
resource limits (CPU in millicores) given for our co-located
containerized applications?
25Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
• Possible answers:
26Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
Analytical
(Expert Modeling )
Black Box
(Machine Learning)
OR
27Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
…BUT WHY?
Black Box
(Machine Learning)
28Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
HYPE?
FUNDING?
TO GET PAPER ACCEPTED?
LAZINESS?
TO GET CITATIONS?
Black Box
(Machine Learning)
29Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
HYPE?
FUNDING?
TO GET PAPER ACCEPTED?
LAZINESS?
TO GET CITATIONS?
Black Box
(Machine Learning)
30Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
TOO MANY APPS TO GENERALIZE
WITH FIXED MODELS
Black Box
(Machine Learning)
• Challenge – many ML approaches (linear regression, lasso regression,
neural networks). How to select?
31Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study I – ML Approach Selection
• Challenge – many ML approaches (linear regression, lasso regression,
neural networks). How to select?
• Via R-squared (R2)! It is a statistical measure that represents the
proportion of the variance for a dependent variable that's explained by
variables in a regression model.
32Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study I – ML Approach Selection
33Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study I – ML Approach Selection
• Unresolved Questions:
 What should be predicted? All? SLIs for the given application?
Specific SLI for all apps?
 What degree of the polynomial should be selected for the model?
34Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
• Option I: Independent Models (single output variable):
 A) Without target variables as predictors
 B) With target variables as predictors
35Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
• Option II: Application-wise Models (two output variables):
 A) Without target variables as predictors
 B) With target variables as predictors
36Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
• Option III: SLI-wise Models (three output variables):
 A) Without target variables as predictors
 B) With target variables as predictors
37Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
• Option IV: All-targets Models (six output variables):
38Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
• Model of choice – Application-wise model of degree 1 with target
variables as predictors
39Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Results
Performance
Model for
App 1
Workload rates
Resource limits
SLIs (App 2, 3)
SLIs
(App 1)
Performance
Model for
App 2
Workload rates
Resource limits
SLIs (App 1, 3)
SLIs
(App 2)
• Reasons:
 Consistent resource limits per app for the later stages
 Well-balanced (High R-squared and small fitting time)
 Scales well with increase in the number of apps
40Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Results
41Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
• Service Level Objectives (SLOs) are the thresholds put on Service
Level Indicators such as throughput or response time that characterize
appropriate behavior of the system
• Example: the user should receive the response in under 800 ms for 99%
percent of the requests and the system should process not less than 30
requests per second
42Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
• Service Level Objectives (SLOs) are the thresholds put on Service
Level Indicators such as throughput or response time that characterize
appropriate behavior of the system
• Example: the user should receive the response in under 800 ms for 99%
percent of the requests and the system should process not less than 30
requests per second
• If there are not enough resources (CPU, memory…), the requests could
end up being dropped or served in more than 800 ms
43Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
• Service Level Objectives (SLOs) are the thresholds put on Service
Level Indicators such as throughput or response time that characterize
appropriate behavior of the system
• Example: the user should receive the response in under 800 ms for 99%
percent of the requests and the system should process not less than 30
requests per second
• If there are not enough resources (CPU, memory…), the requests could
end up being dropped or served in more than 800 ms
44Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
→ The system should have
enough capacity to minimize SLO violations
under the changing workload
45Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
Seems like
OPTIMIZATION PROBLEM!
• SLOs for ith app:
 on throughput:
 on response time:
• Per app-SLI pair cost functions:
 for response time:
 for throughput:
• Application-wise cost function:
46Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Formalism
Predicted SLIs
• Formulation of constrained optimization problem:
47Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Formalism
• Constraint: NP-hard nonlinear integer programming formulation
48Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
• Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
49Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
• Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
• Selected optimization method: trust region-based for nonlinear
constrained optimization
50Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
• Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
• Selected optimization method: trust region-based for nonlinear
constrained optimization
• Alternatives and augmentations:
 pure fine-grained brute force with step size 10 by 10 by 10 (BF-10)
 pure coarse-grained brute force with step size 50 by 50 by 50 (BF-50)
 pure trust region-based continuous optimization (CO)
 continuous optimization with coarse-grained brute force
(Hyb = CO + BF-50)
51Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
• Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
• Selected optimization method: trust region-based for nonlinear
constrained optimization
• Alternatives and augmentations:
 pure fine-grained brute force with step size 10 by 10 by 10 (BF-10)
 pure coarse-grained brute force with step size 50 by 50 by 50 (BF-50)
 pure trust region-based continuous optimization (CO)
 continuous optimization with coarse-grained brute force
(Hyb = CO + BF-50)
52Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
• Parameters:
 For all apps - ms and RPS
 Limit on CPU: 3000 mCPUs
 Number of tests: 10 for both soft and hard limits
• Evaluation results:
53Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
• Evaluation results:
• Conclusions on method selection for resource allocation:
 Hybrid method is more accurate than pure continuous optimization
 Fine-grained brute force is out of option due to high execution time
 Coarse-grained brute force has good accuracy and low execution time
but scales badly
 Hybrid method finds a good balance between execution time and the
quality of solution
54Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
55Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
Hence,
We will allocate the resources
with the hybrid approach
Evaluation:
Method
Results
Limitations
56Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
• Validation test consists of two parts:
 Preliminary Validation Test (PVT) to acquire SLIs values used as in
application-wise performance model (16 times with ab); no optimization
 Evaluation Validation Test (EVT) to conduct the real evaluation based
on values from PVT (16 times with ab); optimization is done with hybrid
approach
57Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Evaluation Method
• Test settings:
 PVT:
 EVT:
58Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Evaluation Method
Result of PVT
• The approach proven to be appropriate for the installation and SLOs:
 at most 2 SLO violations out of 16 trials for 99%-tile response time
 at most 1 SLO violation out of 16 trials for the throughput
59Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Evaluation Results
• Simplistic dataset that does not allow to proof whether the approach is
feasible for more complex applications
• Performance models susceptible to influences of events that are not
reflected by the input variables (such as garbage collection in Java apps)
• Cost of the resources is not taken into account
• Focus on CPU
60Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Limitations of the Approach
Usage Scenario:
Kubernetes’ Vertical Pods Autoscaler (VPA)
Augmenting VPA
61Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
• Known Kubernetes autoscaling options:
 Horizontal Pod Autoscaler (HPA) // production-ready; allows to change number of pods
 Cluster Autoscaler (CA) // production-ready; allows to change number of instances for
various cloud providers overriding native autoscalers
 Vertical Pod Autoscaler (VPA) // beta; allows to change amount of resources (CPU,
memory) allocated to pod, but requires pod restart
 Addon Resizer (AR) // beta; simplified VPA that modifies resource requests based on the
number of nodes
• All options utilize reactive approach
• HPA supports arbitrary scaling metrics
• VPA and HPA are currently incompatible when scaling on memory/CPU
62Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Autoscaling in Kubernetes
Vertical Pod Autoscaler (VPA) sets the resource requests automatically based on usage and
thus allowing the proper scheduling onto nodes so that appropriate resource amount is
available for each pod1)
Core components of VPA – Recommender and Updater
63Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Vertical Pod Autoscaler
1) https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
Recommender computes the recommended resource requests for pods based on current and
historical usage of resources.
First, it equally shares the minimal amount of resources for containers in the given pod (by
default – 250 Mb of memory, 25 mCPUs):
𝒓𝒊 =
𝟏
𝒏𝒋
∙ 𝑹𝒋
where for the given resource type: 𝑟𝑖 is the minimal amount of the given resources for ith
container of jth pod, 𝑛𝑗 is the number of containers in jth pod, 𝑅𝑗 is the minimal amount of the
given resource for jth pod.
64Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
VPA: Recommender (1)
Second, it utilizes three chains of estimators to produce target estimate, lower bound
estimate and the upper bound estimate of the resources to allocate to the pod. Each chain
contains percentile estimator (90%, 50%, 95% corr.) and margin estimator (15% overhead)
that follows it. For lower and upper bounds, the confidence multiplier is added (𝑘 =
1 + 𝑚 𝑑 𝑒
). Max of these estimates and of minimal amount of resource from before is
selected. Input is resource usage data collected for 𝑑 days.
So, for either of two resource types (CPU, memory) we have:
𝑅𝑗 = max⁡ 𝒓𝒊⁡; 1.15 ∙ 𝑟𝑖,90%
𝑛 𝑗
𝑖=1
𝑅𝑗 = max⁡ 𝒓𝒊⁡; 1.15 ∙ 𝑟𝑖,50% ∙ 1 +
0.001
𝑑
−2𝑛 𝑗
𝑖=1
𝑅𝑗 = max⁡ 𝒓𝒊⁡; 1.15 ∙ 𝑟𝑖,95% ∙ 1 +
1
𝑑
𝑛 𝑗
𝑖=1
65Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
VPA: Recommender (2)
Updater runs in Kubernetes cluster and decides which pods should be restarted based on
resources allocation recommendation calculated by Recommender. Practically speaking,
Updater evicts the pods to be updated, whereas the actual recreation of pods with new
resource requests is shifted to the particular controller of pods (e.g. Deployment/ReplicaSet).
The only noticeable thing about the Updater is that pods are updated in the order of priority.
Update priority is proportional to fraction by which resources should be increased / decreased.
Hence, the update priority for the jth pod is computed as follows:
𝑝𝑗 =
𝐶𝑃𝑈𝑖
𝑅𝑒𝑞
− 𝐶𝑃𝑈𝑖
𝑅𝑒𝑐𝑛 𝑗
𝑖=1
𝑛 𝑗
𝑖=1
𝐶𝑃𝑈𝑖
𝑅𝑒𝑞𝑛 𝑗
𝑖=1
+
𝑚𝑒𝑚𝑖
𝑅𝑒𝑞
− 𝑚𝑒𝑚𝑖
𝑅𝑒𝑐𝑛 𝑗
𝑖=1
𝑛 𝑗
𝑖=1
𝑚𝑒𝑚𝑖
𝑅𝑒𝑞𝑛 𝑗
𝑖=1
Currently the only supported update strategy is based on pods restarts.
66Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
VPA: Updater
• Add relevant VPA settings, e.g. SLOs, number of trials to get data
• Add VPA Performance Data Collector
• Add VPA Recommender option to use the presented approach
for SLO-compliant resource allocation
• Augment VPA Updater and runtimes to avoid restart of pods
on vertical scaling
67Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Augmenting VPA: the Proposal
Conclusions & Future Work
68Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
• Major contribution:
 approach to the SLO-compliant resource allocation problem for co-
located containerized applications with the following steps:
1. Collection SLI values for various resource limits and workload rates
2. Removing anomalies that cannot be explained through available
features via clustering
3. Learning prediction models relating SLIs to parameters of workload
and resource limits
4. Deriving the resource limits for applications deployment via
continuous optimization and limited brute force search for known
SLOs
 the approach is validated with at most 2 SLO violations among 16 trials
69Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Conclusions
• Additional findings:
 an approach to select the model features and parameters thereof in
order to increase the accuracy of the SLI prediction model
 lasso regression-based models of degree 2 seem to suffice for
predicting SLIs
70Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Conclusions
• Augment the approach and repeat the study for larger and more realistic
set of apps
71Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
• Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
72Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
• Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
• Evaluation of artificial neural networks for predicting SLIs
73Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
• Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
• Evaluation of artificial neural networks for predicting SLIs
• SLO-compliant resource allocation for individual microservices of
compound applications
74Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
• Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
• Evaluation of artificial neural networks for predicting SLIs
• SLO-compliant resource allocation for individual microservices of
compound applications
• Derivation of models for allocation of other resources like RAM
75Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
76
Contacts
Vladimir Podolskiy
v.podolskiy@tum.de
/vladimirpodolskiy
/Vladimir_Podolskiy
Panos Patros
panos.patros@waikato.ac.nz
/panos-patros
/Panos_Patros

More Related Content

What's hot

Cncf checkov and bridgecrew
Cncf checkov and bridgecrewCncf checkov and bridgecrew
Cncf checkov and bridgecrewLibbySchulze
 
Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...
Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...
Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...VMware Tanzu
 
Accelerate Digital Transformation with Pivotal Cloud Foundry on Azure
Accelerate Digital Transformation with Pivotal Cloud Foundry on AzureAccelerate Digital Transformation with Pivotal Cloud Foundry on Azure
Accelerate Digital Transformation with Pivotal Cloud Foundry on AzureVMware Tanzu
 
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...VMware Tanzu
 
Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...
Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...
Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...VMware Tanzu
 
Java Application Modernization Patterns and Stories from the IBM Garage
Java Application Modernization Patterns and Stories from the IBM GarageJava Application Modernization Patterns and Stories from the IBM Garage
Java Application Modernization Patterns and Stories from the IBM GarageHolly Cummins
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview VMware Tanzu
 
PKS: The What and How of Enterprise-Grade Kubernetes
PKS: The What and How of Enterprise-Grade KubernetesPKS: The What and How of Enterprise-Grade Kubernetes
PKS: The What and How of Enterprise-Grade KubernetesVMware Tanzu
 
PCF: Platform for a New Era - Kubernetes for the Enterprise - London
PCF: Platform for a New Era - Kubernetes for the Enterprise - LondonPCF: Platform for a New Era - Kubernetes for the Enterprise - London
PCF: Platform for a New Era - Kubernetes for the Enterprise - LondonVMware Tanzu
 
Welcome - Kubernetes for the Enterprise - London
Welcome - Kubernetes for the Enterprise - LondonWelcome - Kubernetes for the Enterprise - London
Welcome - Kubernetes for the Enterprise - LondonVMware Tanzu
 
Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...
Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...
Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...VMware Tanzu
 
A Single Platform to Run All The Things - Kubernetes for the Enterprise - London
A Single Platform to Run All The Things - Kubernetes for the Enterprise - LondonA Single Platform to Run All The Things - Kubernetes for the Enterprise - London
A Single Platform to Run All The Things - Kubernetes for the Enterprise - LondonVMware Tanzu
 
DevOps KPIs as a Service: Daimler’s Solution
DevOps KPIs as a Service: Daimler’s SolutionDevOps KPIs as a Service: Daimler’s Solution
DevOps KPIs as a Service: Daimler’s SolutionVMware Tanzu
 
I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...
I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...
I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...VMware Tanzu
 
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in aziendaPivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in aziendaVMware Tanzu
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceCloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceVMware Tanzu
 
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...VMware Tanzu
 
Migrating to Cloud Native Solutions
Migrating to Cloud Native SolutionsMigrating to Cloud Native Solutions
Migrating to Cloud Native Solutionsinwin stack
 
Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...
Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...
Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...VMware Tanzu
 

What's hot (20)

Cncf checkov and bridgecrew
Cncf checkov and bridgecrewCncf checkov and bridgecrew
Cncf checkov and bridgecrew
 
Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...
Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...
Eseguire Applicazioni Cloud-Native con Pivotal Cloud Foundry su Google Cloud ...
 
Accelerate Digital Transformation with Pivotal Cloud Foundry on Azure
Accelerate Digital Transformation with Pivotal Cloud Foundry on AzureAccelerate Digital Transformation with Pivotal Cloud Foundry on Azure
Accelerate Digital Transformation with Pivotal Cloud Foundry on Azure
 
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
 
Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...
Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...
Pivotal Container Service il modo più semplice per gestire Kubernetes in azie...
 
Java Application Modernization Patterns and Stories from the IBM Garage
Java Application Modernization Patterns and Stories from the IBM GarageJava Application Modernization Patterns and Stories from the IBM Garage
Java Application Modernization Patterns and Stories from the IBM Garage
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview
 
PKS: The What and How of Enterprise-Grade Kubernetes
PKS: The What and How of Enterprise-Grade KubernetesPKS: The What and How of Enterprise-Grade Kubernetes
PKS: The What and How of Enterprise-Grade Kubernetes
 
PCF: Platform for a New Era - Kubernetes for the Enterprise - London
PCF: Platform for a New Era - Kubernetes for the Enterprise - LondonPCF: Platform for a New Era - Kubernetes for the Enterprise - London
PCF: Platform for a New Era - Kubernetes for the Enterprise - London
 
Welcome - Kubernetes for the Enterprise - London
Welcome - Kubernetes for the Enterprise - LondonWelcome - Kubernetes for the Enterprise - London
Welcome - Kubernetes for the Enterprise - London
 
Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...
Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...
Costruire Applicazioni Cloud-Native con Spring Boot (Pivotal Cloud-Native Wor...
 
A Single Platform to Run All The Things - Kubernetes for the Enterprise - London
A Single Platform to Run All The Things - Kubernetes for the Enterprise - LondonA Single Platform to Run All The Things - Kubernetes for the Enterprise - London
A Single Platform to Run All The Things - Kubernetes for the Enterprise - London
 
DevOps KPIs as a Service: Daimler’s Solution
DevOps KPIs as a Service: Daimler’s SolutionDevOps KPIs as a Service: Daimler’s Solution
DevOps KPIs as a Service: Daimler’s Solution
 
Netflix MSA and Pivotal
Netflix MSA and PivotalNetflix MSA and Pivotal
Netflix MSA and Pivotal
 
I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...
I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...
I Segreti per Modernizzare con Successo le Applicazioni (Pivotal Cloud-Native...
 
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in aziendaPivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceCloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
 
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
 
Migrating to Cloud Native Solutions
Migrating to Cloud Native SolutionsMigrating to Cloud Native Solutions
Migrating to Cloud Native Solutions
 
Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...
Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...
Sicurezza integrate nella tua piattaforma Cloud-Native con VMware NSX (Pivota...
 

Similar to Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...Liming Zhu
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
Intro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backendIntro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backendAmin Golnari
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6Sravanthi N
 
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Fwdays
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Liming Zhu
 
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - OptimisationIBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - OptimisationIBM France Lab
 
The RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkThe RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkRECAP Project
 
Victor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkVictor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkCBOD ANR project U-PSUD
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementMongoDB
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
StorPool Storage Оverview and Integration with CloudStack
StorPool Storage Оverview and Integration with CloudStackStorPool Storage Оverview and Integration with CloudStack
StorPool Storage Оverview and Integration with CloudStackShapeBlue
 
RECAP Project Overview
RECAP Project OverviewRECAP Project Overview
RECAP Project OverviewRECAP Project
 
Machine Learning in the Real World
Machine Learning in the Real WorldMachine Learning in the Real World
Machine Learning in the Real WorldSrinath Perera
 
Use Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open WorldUse Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open WorldMongoDB
 
Prespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandasPrespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandasPyDataParis
 

Similar to Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing (20)

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Intro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backendIntro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backend
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6
 
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...
 
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - OptimisationIBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
 
The RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkThe RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation Framework
 
Victor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkVictor Chang: Cloud computing business framework
Victor Chang: Cloud computing business framework
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
StorPool Storage Оverview and Integration with CloudStack
StorPool Storage Оverview and Integration with CloudStackStorPool Storage Оverview and Integration with CloudStack
StorPool Storage Оverview and Integration with CloudStack
 
RECAP Project Overview
RECAP Project OverviewRECAP Project Overview
RECAP Project Overview
 
Machine Learning in the Real World
Machine Learning in the Real WorldMachine Learning in the Real World
Machine Learning in the Real World
 
Use Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open WorldUse Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open World
 
Prespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandasPrespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandas
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing

  • 1. Vladimir Podolskiy*, Michael Mayo**, Abigail Koay**, Michael Gerndt*, Panos Patros** *Technical University of Munich (TUM), Germany **University of Waikato, New Zealand IEEE SASO 2019 Umeå, Sweden, June 18th 2019 Full Paper Cloud-based Adaptation Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
  • 2. 2Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… We are… Vladimir Podolskiy 3rd-year PhD, TUM Predictive autoscaling and anomaly detection Panos Patros Lecturer in Software Engineering Head of ORCA lab
  • 3. • ORCA Lab at Waikato, NZ • Started in Jan 2019 • 1 Research Assistant • 9 Research Students • 7 Faculty • 2 Interns • 8 International Collaborators (Canada, Germany, USA) • New graduate-level course (70% Project Component) • COMPX529 Engineering Self-Adaptive Systems 3Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Oceania Researchers in Cloud and Adaptive-systems Ohu Rangahau Kapua Aunoa
  • 4. • Background:  Containerization & Cloud Computing  Resource Sharing & Container Orchestration via Kubernetes  Machine Learning & Lasso Regression • Motivation of the Study • Research Problem • Data Collection • Proposed Approach:  Anomalies Removal  Prediction of the Service Level Indicators  SLO-compliant Resource Allocation • Evaluation:  Method  Results  Limitations • Usage Scenario: Augmenting Vertical Pods Autoscaler (VPA) of Kubernetes • Conclusions & Future Work 4Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Contents
  • 5. Introduction: Background Motivation Research Problem 5Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
  • 6. • Cloud • Abstracts Computing Resources • Containers • Low-overhead OS-level virtualization • Multitenancy • Containerization of apps • Kuberneres • Orchestration of app containers • Resource Management • Soft (e.g. CPU shares) and hard (e.g. LFS) limits • Machine Learning • Least Absolute Shrinkage and Selection Operator (LASSO) • Linear Regression, shrinks data towards mean 6Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Background
  • 7. • Satisfyings requirements at varying loads • Engineering is for the people • Business, Environment and Society Sustainability • Cloud Service Level Agreements (SLAs) • Service Level Objectives (SLOs) • Financial penalties • Service Level Indicators (SLIs) • Tradeoffs • Performance • Resource Consumption and Cost • Isolation and Security • State of the art: scaling out (adding containers) 7Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Motivation
  • 8. • Consider a saturated container cloud • Little/no benefit from scaling out • Or up (finite pie) • Instead, change resource limits • However, CPU utilization SLIs do not mean much to end-users • Instead, use response time and throughput • However, hard to autonomously map to limits • Therefore, main contribution • Collect dataset from multitenant deployments • Detect and remove anomalies (GC, etc.) • Machine Learn RT/Thru performance models • Resize containers based on loads and target SLOs 8Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Research Problem
  • 9. Data Collection: Testbed Applications Load Driving 9Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
  • 10. • Machine provided by STRATUS (cybersecurity) project • Availability is a key cybersecurity requirement! • 24-CPU Intel Xeon • 256GB RAM • Private and local cloud • Performance isolation is imperative • Can„t rely on public clouds • 8 VMs (4 CPUs + 4GB RAM) • 1 master, 8 workers • Kubernetes using Oracle„s Vagrant Script • Load was driven from a separate machine • Performance isolation is imperative!  10Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Testbed
  • 11. 1. NGINX (app1) • Single container image • Replicated 7 times (Load balancer Kubernetes service exposed) 2. IBM Webshpere Liberty Profile (app2) • Single container image: runs IBM JVM + Liberty Profile • Replicated 7 times (Load balancer Kubernetes service exposed) 3. Redis + PHP Guestbook • PHP x7, Redis-Master x1, Redis-Slave x7 11Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Deployed Applications
  • 12. • Separate Machine provided by University of Waikato • 8-CPU Intel Xeon • 16GB RAM • Stress-Testing Script (dataset creation): 1. Select random workloads 2. Select random CPU limits (soft and hard) 3. Redeploy apps 4. Fire requests using Apache ab x16 5. Collect RT and Thru SLIs 6. Repeat x500 12Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Load Driving
  • 13. Approach to Maintain SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing: Anomalies Removal Prediction of the Service Level Indicators SLO-compliant Resource Allocation Limitations of the Approach 13Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
  • 14. 14Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… MAPE-K Inspired Architecture of the Solution
  • 15. 15Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Approach Overview Collecting SLI Values for Various Resource Limits and Workload Rates Anomalies Identification and Removal Learning Prediction Models 𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚) Deriving the Resource Limits for Applications via Optimization
  • 16. 16Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Approach Overview Collecting SLI Values for Various Resource Limits and Workload Rates Anomalies Identification and Removal Learning Prediction Models 𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚) Deriving the Resource Limits for Applications via Optimization
  • 17. • Any data-based method, including prediction, is as good as the data given to it (Garbage-in-Garbage-our principle)  If the input does not contain the information to describe the output, then the model produced by any approach won‟t be accurate 17Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Anomalies Removal: Motivation
  • 18. • Solution: 18Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Anomalies Removal: Motivation Add Missing Information to the Input (collect anew?) Remove the Intractable Data from the Output OR
  • 19. 19Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Anomalies Removal: Motivation GOAL – to leave only the explainable (~normal distribution) Remove the Intractable Data from the Output
  • 20. 20Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Anomalies Removal: Motivation
  • 21. 1. Expectation-Maximization (EM) Clustering with 10-fold cross- validation to get the clusters of similar observations 2. Find the cluster that represents the anomalies (“too high” and “too low” SLI values + high standard deviation) 3. Remove the data grouped into this cluster from the dataset and try to fit the model to see whether R2 score improves. ~13% of anomalies to be removed 21Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Anomalies Removal: Approach Alternative Isolation Forest approach gives similar ~11%
  • 22. 22Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Anomalies Removal: Result
  • 23. 23Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Approach Overview Collecting SLI Values for Various Resource Limits and Workload Rates Anomalies Identification and Removal Learning Prediction Models 𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚) Deriving the Resource Limits for Applications via Optimization
  • 24. • Performance model allows us to answer the following question  What performance can be achieved for the given configuration? … without testing all the possible options. 24Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Motivation
  • 25. • Performance model allows us to answer the following question  What performance can be achieved for the given configuration? … without testing all the possible options. • In our context that question would sound:  What service level indicators values (throughput and 99%-tile response time) can be achieved for the workload rate and resource limits (CPU in millicores) given for our co-located containerized applications? 25Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Motivation
  • 26. • Possible answers: 26Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Motivation Analytical (Expert Modeling ) Black Box (Machine Learning) OR
  • 27. 27Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Motivation …BUT WHY? Black Box (Machine Learning)
  • 28. 28Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Motivation HYPE? FUNDING? TO GET PAPER ACCEPTED? LAZINESS? TO GET CITATIONS? Black Box (Machine Learning)
  • 29. 29Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Motivation HYPE? FUNDING? TO GET PAPER ACCEPTED? LAZINESS? TO GET CITATIONS? Black Box (Machine Learning)
  • 30. 30Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Motivation TOO MANY APPS TO GENERALIZE WITH FIXED MODELS Black Box (Machine Learning)
  • 31. • Challenge – many ML approaches (linear regression, lasso regression, neural networks). How to select? 31Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study I – ML Approach Selection
  • 32. • Challenge – many ML approaches (linear regression, lasso regression, neural networks). How to select? • Via R-squared (R2)! It is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by variables in a regression model. 32Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study I – ML Approach Selection
  • 33. 33Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study I – ML Approach Selection
  • 34. • Unresolved Questions:  What should be predicted? All? SLIs for the given application? Specific SLI for all apps?  What degree of the polynomial should be selected for the model? 34Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study II – Model & Parameters Selection
  • 35. • Option I: Independent Models (single output variable):  A) Without target variables as predictors  B) With target variables as predictors 35Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study II – Model & Parameters Selection
  • 36. • Option II: Application-wise Models (two output variables):  A) Without target variables as predictors  B) With target variables as predictors 36Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study II – Model & Parameters Selection
  • 37. • Option III: SLI-wise Models (three output variables):  A) Without target variables as predictors  B) With target variables as predictors 37Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study II – Model & Parameters Selection
  • 38. • Option IV: All-targets Models (six output variables): 38Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Pre-Study II – Model & Parameters Selection
  • 39. • Model of choice – Application-wise model of degree 1 with target variables as predictors 39Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Results Performance Model for App 1 Workload rates Resource limits SLIs (App 2, 3) SLIs (App 1) Performance Model for App 2 Workload rates Resource limits SLIs (App 1, 3) SLIs (App 2)
  • 40. • Reasons:  Consistent resource limits per app for the later stages  Well-balanced (High R-squared and small fitting time)  Scales well with increase in the number of apps 40Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Learning the Performance Model: Results
  • 41. 41Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Approach Overview Collecting SLI Values for Various Resource Limits and Workload Rates Anomalies Identification and Removal Learning Prediction Models 𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚) Deriving the Resource Limits for Applications via Optimization
  • 42. • Service Level Objectives (SLOs) are the thresholds put on Service Level Indicators such as throughput or response time that characterize appropriate behavior of the system • Example: the user should receive the response in under 800 ms for 99% percent of the requests and the system should process not less than 30 requests per second 42Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Motivation
  • 43. • Service Level Objectives (SLOs) are the thresholds put on Service Level Indicators such as throughput or response time that characterize appropriate behavior of the system • Example: the user should receive the response in under 800 ms for 99% percent of the requests and the system should process not less than 30 requests per second • If there are not enough resources (CPU, memory…), the requests could end up being dropped or served in more than 800 ms 43Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Motivation
  • 44. • Service Level Objectives (SLOs) are the thresholds put on Service Level Indicators such as throughput or response time that characterize appropriate behavior of the system • Example: the user should receive the response in under 800 ms for 99% percent of the requests and the system should process not less than 30 requests per second • If there are not enough resources (CPU, memory…), the requests could end up being dropped or served in more than 800 ms 44Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Motivation → The system should have enough capacity to minimize SLO violations under the changing workload
  • 45. 45Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Motivation Seems like OPTIMIZATION PROBLEM!
  • 46. • SLOs for ith app:  on throughput:  on response time: • Per app-SLI pair cost functions:  for response time:  for throughput: • Application-wise cost function: 46Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Formalism Predicted SLIs
  • 47. • Formulation of constrained optimization problem: 47Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Formalism
  • 48. • Constraint: NP-hard nonlinear integer programming formulation 48Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design
  • 49. • Constraint: NP-hard nonlinear integer programming formulation • Workaround: solving as continuous constrained optimization problem 49Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design
  • 50. • Constraint: NP-hard nonlinear integer programming formulation • Workaround: solving as continuous constrained optimization problem • Selected optimization method: trust region-based for nonlinear constrained optimization 50Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design
  • 51. • Constraint: NP-hard nonlinear integer programming formulation • Workaround: solving as continuous constrained optimization problem • Selected optimization method: trust region-based for nonlinear constrained optimization • Alternatives and augmentations:  pure fine-grained brute force with step size 10 by 10 by 10 (BF-10)  pure coarse-grained brute force with step size 50 by 50 by 50 (BF-50)  pure trust region-based continuous optimization (CO)  continuous optimization with coarse-grained brute force (Hyb = CO + BF-50) 51Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design
  • 52. • Constraint: NP-hard nonlinear integer programming formulation • Workaround: solving as continuous constrained optimization problem • Selected optimization method: trust region-based for nonlinear constrained optimization • Alternatives and augmentations:  pure fine-grained brute force with step size 10 by 10 by 10 (BF-10)  pure coarse-grained brute force with step size 50 by 50 by 50 (BF-50)  pure trust region-based continuous optimization (CO)  continuous optimization with coarse-grained brute force (Hyb = CO + BF-50) 52Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design
  • 53. • Parameters:  For all apps - ms and RPS  Limit on CPU: 3000 mCPUs  Number of tests: 10 for both soft and hard limits • Evaluation results: 53Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design
  • 54. • Evaluation results: • Conclusions on method selection for resource allocation:  Hybrid method is more accurate than pure continuous optimization  Fine-grained brute force is out of option due to high execution time  Coarse-grained brute force has good accuracy and low execution time but scales badly  Hybrid method finds a good balance between execution time and the quality of solution 54Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design
  • 55. 55Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… SLO-compliant Resource Allocation: Design Hence, We will allocate the resources with the hybrid approach
  • 56. Evaluation: Method Results Limitations 56Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
  • 57. • Validation test consists of two parts:  Preliminary Validation Test (PVT) to acquire SLIs values used as in application-wise performance model (16 times with ab); no optimization  Evaluation Validation Test (EVT) to conduct the real evaluation based on values from PVT (16 times with ab); optimization is done with hybrid approach 57Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Evaluation Method
  • 58. • Test settings:  PVT:  EVT: 58Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Evaluation Method Result of PVT
  • 59. • The approach proven to be appropriate for the installation and SLOs:  at most 2 SLO violations out of 16 trials for 99%-tile response time  at most 1 SLO violation out of 16 trials for the throughput 59Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Evaluation Results
  • 60. • Simplistic dataset that does not allow to proof whether the approach is feasible for more complex applications • Performance models susceptible to influences of events that are not reflected by the input variables (such as garbage collection in Java apps) • Cost of the resources is not taken into account • Focus on CPU 60Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Limitations of the Approach
  • 61. Usage Scenario: Kubernetes’ Vertical Pods Autoscaler (VPA) Augmenting VPA 61Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
  • 62. • Known Kubernetes autoscaling options:  Horizontal Pod Autoscaler (HPA) // production-ready; allows to change number of pods  Cluster Autoscaler (CA) // production-ready; allows to change number of instances for various cloud providers overriding native autoscalers  Vertical Pod Autoscaler (VPA) // beta; allows to change amount of resources (CPU, memory) allocated to pod, but requires pod restart  Addon Resizer (AR) // beta; simplified VPA that modifies resource requests based on the number of nodes • All options utilize reactive approach • HPA supports arbitrary scaling metrics • VPA and HPA are currently incompatible when scaling on memory/CPU 62Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Autoscaling in Kubernetes
  • 63. Vertical Pod Autoscaler (VPA) sets the resource requests automatically based on usage and thus allowing the proper scheduling onto nodes so that appropriate resource amount is available for each pod1) Core components of VPA – Recommender and Updater 63Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Vertical Pod Autoscaler 1) https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
  • 64. Recommender computes the recommended resource requests for pods based on current and historical usage of resources. First, it equally shares the minimal amount of resources for containers in the given pod (by default – 250 Mb of memory, 25 mCPUs): 𝒓𝒊 = 𝟏 𝒏𝒋 ∙ 𝑹𝒋 where for the given resource type: 𝑟𝑖 is the minimal amount of the given resources for ith container of jth pod, 𝑛𝑗 is the number of containers in jth pod, 𝑅𝑗 is the minimal amount of the given resource for jth pod. 64Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… VPA: Recommender (1)
  • 65. Second, it utilizes three chains of estimators to produce target estimate, lower bound estimate and the upper bound estimate of the resources to allocate to the pod. Each chain contains percentile estimator (90%, 50%, 95% corr.) and margin estimator (15% overhead) that follows it. For lower and upper bounds, the confidence multiplier is added (𝑘 = 1 + 𝑚 𝑑 𝑒 ). Max of these estimates and of minimal amount of resource from before is selected. Input is resource usage data collected for 𝑑 days. So, for either of two resource types (CPU, memory) we have: 𝑅𝑗 = max⁡ 𝒓𝒊⁡; 1.15 ∙ 𝑟𝑖,90% 𝑛 𝑗 𝑖=1 𝑅𝑗 = max⁡ 𝒓𝒊⁡; 1.15 ∙ 𝑟𝑖,50% ∙ 1 + 0.001 𝑑 −2𝑛 𝑗 𝑖=1 𝑅𝑗 = max⁡ 𝒓𝒊⁡; 1.15 ∙ 𝑟𝑖,95% ∙ 1 + 1 𝑑 𝑛 𝑗 𝑖=1 65Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… VPA: Recommender (2)
  • 66. Updater runs in Kubernetes cluster and decides which pods should be restarted based on resources allocation recommendation calculated by Recommender. Practically speaking, Updater evicts the pods to be updated, whereas the actual recreation of pods with new resource requests is shifted to the particular controller of pods (e.g. Deployment/ReplicaSet). The only noticeable thing about the Updater is that pods are updated in the order of priority. Update priority is proportional to fraction by which resources should be increased / decreased. Hence, the update priority for the jth pod is computed as follows: 𝑝𝑗 = 𝐶𝑃𝑈𝑖 𝑅𝑒𝑞 − 𝐶𝑃𝑈𝑖 𝑅𝑒𝑐𝑛 𝑗 𝑖=1 𝑛 𝑗 𝑖=1 𝐶𝑃𝑈𝑖 𝑅𝑒𝑞𝑛 𝑗 𝑖=1 + 𝑚𝑒𝑚𝑖 𝑅𝑒𝑞 − 𝑚𝑒𝑚𝑖 𝑅𝑒𝑐𝑛 𝑗 𝑖=1 𝑛 𝑗 𝑖=1 𝑚𝑒𝑚𝑖 𝑅𝑒𝑞𝑛 𝑗 𝑖=1 Currently the only supported update strategy is based on pods restarts. 66Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… VPA: Updater
  • 67. • Add relevant VPA settings, e.g. SLOs, number of trials to get data • Add VPA Performance Data Collector • Add VPA Recommender option to use the presented approach for SLO-compliant resource allocation • Augment VPA Updater and runtimes to avoid restart of pods on vertical scaling 67Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Augmenting VPA: the Proposal
  • 68. Conclusions & Future Work 68Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
  • 69. • Major contribution:  approach to the SLO-compliant resource allocation problem for co- located containerized applications with the following steps: 1. Collection SLI values for various resource limits and workload rates 2. Removing anomalies that cannot be explained through available features via clustering 3. Learning prediction models relating SLIs to parameters of workload and resource limits 4. Deriving the resource limits for applications deployment via continuous optimization and limited brute force search for known SLOs  the approach is validated with at most 2 SLO violations among 16 trials 69Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Conclusions
  • 70. • Additional findings:  an approach to select the model features and parameters thereof in order to increase the accuracy of the SLI prediction model  lasso regression-based models of degree 2 seem to suffice for predicting SLIs 70Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Conclusions
  • 71. • Augment the approach and repeat the study for larger and more realistic set of apps 71Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Future Work
  • 72. • Augment the approach and repeat the study for larger and more realistic set of apps • Evaluation of the impact of runtime/technology-specific behaviors like garbage collection on SLIs and search for predictors for such behaviors 72Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Future Work
  • 73. • Augment the approach and repeat the study for larger and more realistic set of apps • Evaluation of the impact of runtime/technology-specific behaviors like garbage collection on SLIs and search for predictors for such behaviors • Evaluation of artificial neural networks for predicting SLIs 73Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Future Work
  • 74. • Augment the approach and repeat the study for larger and more realistic set of apps • Evaluation of the impact of runtime/technology-specific behaviors like garbage collection on SLIs and search for predictors for such behaviors • Evaluation of artificial neural networks for predicting SLIs • SLO-compliant resource allocation for individual microservices of compound applications 74Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Future Work
  • 75. • Augment the approach and repeat the study for larger and more realistic set of apps • Evaluation of the impact of runtime/technology-specific behaviors like garbage collection on SLIs and search for predictors for such behaviors • Evaluation of artificial neural networks for predicting SLIs • SLO-compliant resource allocation for individual microservices of compound applications • Derivation of models for allocation of other resources like RAM 75Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive… Future Work