This document presents an approach for maintaining service level objectives (SLOs) of cloud-native applications via self-adaptive resource sharing. The approach involves collecting performance data under varying resource limits and workloads, removing anomalies from the data, learning prediction models to map service level indicators to resource limits and workloads, and optimizing resource allocation to meet SLOs under changing conditions. The approach is evaluated using three applications deployed on Kubernetes and aims to augment Kubernetes' vertical pod autoscaler.
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
1. Vladimir Podolskiy*, Michael Mayo**, Abigail Koay**,
Michael Gerndt*, Panos Patros**
*Technical University of Munich (TUM), Germany
**University of Waikato, New Zealand
IEEE SASO 2019
Umeå, Sweden, June 18th 2019
Full Paper
Cloud-based Adaptation
Maintaining SLOs of Cloud-native Applications
via Self-Adaptive Resource Sharing
2. 2Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
We are…
Vladimir Podolskiy
3rd-year PhD, TUM
Predictive
autoscaling and
anomaly detection
Panos Patros
Lecturer in
Software
Engineering
Head of ORCA lab
3. • ORCA Lab at Waikato, NZ
• Started in Jan 2019
• 1 Research Assistant
• 9 Research Students
• 7 Faculty
• 2 Interns
• 8 International Collaborators (Canada, Germany, USA)
• New graduate-level course (70% Project Component)
• COMPX529 Engineering Self-Adaptive Systems
3Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Oceania Researchers in Cloud and Adaptive-systems
Ohu Rangahau Kapua Aunoa
4. • Background:
Containerization & Cloud Computing
Resource Sharing & Container Orchestration via Kubernetes
Machine Learning & Lasso Regression
• Motivation of the Study
• Research Problem
• Data Collection
• Proposed Approach:
Anomalies Removal
Prediction of the Service Level Indicators
SLO-compliant Resource Allocation
• Evaluation:
Method
Results
Limitations
• Usage Scenario: Augmenting Vertical Pods Autoscaler (VPA) of Kubernetes
• Conclusions & Future Work
4Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Contents
6. • Cloud
• Abstracts Computing Resources
• Containers
• Low-overhead OS-level virtualization
• Multitenancy
• Containerization of apps
• Kuberneres
• Orchestration of app containers
• Resource Management
• Soft (e.g. CPU shares) and hard (e.g. LFS) limits
• Machine Learning
• Least Absolute Shrinkage and Selection Operator (LASSO)
• Linear Regression, shrinks data towards mean
6Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Background
7. • Satisfyings requirements at varying loads
• Engineering is for the people
• Business, Environment and Society Sustainability
• Cloud Service Level Agreements (SLAs)
• Service Level Objectives (SLOs)
• Financial penalties
• Service Level Indicators (SLIs)
• Tradeoffs
• Performance
• Resource Consumption and Cost
• Isolation and Security
• State of the art: scaling out (adding containers)
7Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Motivation
8. • Consider a saturated container cloud
• Little/no benefit from scaling out
• Or up (finite pie)
• Instead, change resource limits
• However, CPU utilization SLIs do not mean much to end-users
• Instead, use response time and throughput
• However, hard to autonomously map to limits
• Therefore, main contribution
• Collect dataset from multitenant deployments
• Detect and remove anomalies (GC, etc.)
• Machine Learn RT/Thru performance models
• Resize containers based on loads and target SLOs
8Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Research Problem
10. • Machine provided by STRATUS (cybersecurity) project
• Availability is a key cybersecurity requirement!
• 24-CPU Intel Xeon
• 256GB RAM
• Private and local cloud
• Performance isolation is imperative
• Can„t rely on public clouds
• 8 VMs (4 CPUs + 4GB RAM)
• 1 master, 8 workers
• Kubernetes using Oracle„s Vagrant Script
• Load was driven from a separate machine
• Performance isolation is imperative!
10Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Testbed
11. 1. NGINX (app1)
• Single container image
• Replicated 7 times (Load balancer Kubernetes service
exposed)
2. IBM Webshpere Liberty Profile (app2)
• Single container image: runs IBM JVM + Liberty Profile
• Replicated 7 times (Load balancer Kubernetes service
exposed)
3. Redis + PHP Guestbook
• PHP x7, Redis-Master x1, Redis-Slave x7
11Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Deployed Applications
12. • Separate Machine provided by University of Waikato
• 8-CPU Intel Xeon
• 16GB RAM
• Stress-Testing Script (dataset creation):
1. Select random workloads
2. Select random CPU limits (soft and hard)
3. Redeploy apps
4. Fire requests using Apache ab x16
5. Collect RT and Thru SLIs
6. Repeat x500
12Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Load Driving
13. Approach to Maintain SLOs of Cloud-native
Applications via Self-Adaptive Resource Sharing:
Anomalies Removal
Prediction of the Service Level Indicators
SLO-compliant Resource Allocation
Limitations of the Approach
13Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
14. 14Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
MAPE-K Inspired Architecture of the Solution
15. 15Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
16. 16Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
17. • Any data-based method, including prediction, is as good as the data
given to it (Garbage-in-Garbage-our principle)
If the input does not contain the information to describe the output,
then the model produced by any approach won‟t be accurate
17Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
18. • Solution:
18Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
Add Missing
Information to the Input
(collect anew?)
Remove
the Intractable Data
from the Output
OR
19. 19Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
GOAL – to leave only the explainable
(~normal distribution)
Remove
the Intractable Data
from the Output
20. 20Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Motivation
21. 1. Expectation-Maximization (EM)
Clustering with 10-fold cross-
validation to get the clusters of
similar observations
2. Find the cluster that represents
the anomalies (“too high” and
“too low” SLI values + high
standard deviation)
3. Remove the data grouped into
this cluster from the dataset and
try to fit the model to see
whether R2 score improves.
~13% of anomalies to be removed
21Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Approach
Alternative Isolation Forest
approach gives similar ~11%
22. 22Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Anomalies Removal: Result
23. 23Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
24. • Performance model allows us to answer the following question
What performance can be achieved for the given configuration?
… without testing all the possible options.
24Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
25. • Performance model allows us to answer the following question
What performance can be achieved for the given configuration?
… without testing all the possible options.
• In our context that question would sound:
What service level indicators values (throughput and 99%-tile
response time) can be achieved for the workload rate and
resource limits (CPU in millicores) given for our co-located
containerized applications?
25Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
26. • Possible answers:
26Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
Analytical
(Expert Modeling )
Black Box
(Machine Learning)
OR
27. 27Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
…BUT WHY?
Black Box
(Machine Learning)
28. 28Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
HYPE?
FUNDING?
TO GET PAPER ACCEPTED?
LAZINESS?
TO GET CITATIONS?
Black Box
(Machine Learning)
29. 29Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
HYPE?
FUNDING?
TO GET PAPER ACCEPTED?
LAZINESS?
TO GET CITATIONS?
Black Box
(Machine Learning)
30. 30Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Motivation
TOO MANY APPS TO GENERALIZE
WITH FIXED MODELS
Black Box
(Machine Learning)
31. • Challenge – many ML approaches (linear regression, lasso regression,
neural networks). How to select?
31Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study I – ML Approach Selection
32. • Challenge – many ML approaches (linear regression, lasso regression,
neural networks). How to select?
• Via R-squared (R2)! It is a statistical measure that represents the
proportion of the variance for a dependent variable that's explained by
variables in a regression model.
32Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study I – ML Approach Selection
33. 33Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study I – ML Approach Selection
34. • Unresolved Questions:
What should be predicted? All? SLIs for the given application?
Specific SLI for all apps?
What degree of the polynomial should be selected for the model?
34Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
35. • Option I: Independent Models (single output variable):
A) Without target variables as predictors
B) With target variables as predictors
35Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
36. • Option II: Application-wise Models (two output variables):
A) Without target variables as predictors
B) With target variables as predictors
36Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
37. • Option III: SLI-wise Models (three output variables):
A) Without target variables as predictors
B) With target variables as predictors
37Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
38. • Option IV: All-targets Models (six output variables):
38Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Pre-Study II – Model & Parameters Selection
39. • Model of choice – Application-wise model of degree 1 with target
variables as predictors
39Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Results
Performance
Model for
App 1
Workload rates
Resource limits
SLIs (App 2, 3)
SLIs
(App 1)
Performance
Model for
App 2
Workload rates
Resource limits
SLIs (App 1, 3)
SLIs
(App 2)
40. • Reasons:
Consistent resource limits per app for the later stages
Well-balanced (High R-squared and small fitting time)
Scales well with increase in the number of apps
40Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Learning the Performance Model: Results
41. 41Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Approach Overview
Collecting SLI Values for
Various Resource Limits and
Workload Rates
Anomalies Identification and
Removal
Learning Prediction Models
𝑆𝐿𝐼 = 𝑓(𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑, 𝑅𝑒𝑠𝐿𝑖𝑚)
Deriving the Resource Limits for
Applications via Optimization
42. • Service Level Objectives (SLOs) are the thresholds put on Service
Level Indicators such as throughput or response time that characterize
appropriate behavior of the system
• Example: the user should receive the response in under 800 ms for 99%
percent of the requests and the system should process not less than 30
requests per second
42Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
43. • Service Level Objectives (SLOs) are the thresholds put on Service
Level Indicators such as throughput or response time that characterize
appropriate behavior of the system
• Example: the user should receive the response in under 800 ms for 99%
percent of the requests and the system should process not less than 30
requests per second
• If there are not enough resources (CPU, memory…), the requests could
end up being dropped or served in more than 800 ms
43Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
44. • Service Level Objectives (SLOs) are the thresholds put on Service
Level Indicators such as throughput or response time that characterize
appropriate behavior of the system
• Example: the user should receive the response in under 800 ms for 99%
percent of the requests and the system should process not less than 30
requests per second
• If there are not enough resources (CPU, memory…), the requests could
end up being dropped or served in more than 800 ms
44Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
→ The system should have
enough capacity to minimize SLO violations
under the changing workload
45. 45Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Motivation
Seems like
OPTIMIZATION PROBLEM!
46. • SLOs for ith app:
on throughput:
on response time:
• Per app-SLI pair cost functions:
for response time:
for throughput:
• Application-wise cost function:
46Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Formalism
Predicted SLIs
47. • Formulation of constrained optimization problem:
47Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Formalism
48. • Constraint: NP-hard nonlinear integer programming formulation
48Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
49. • Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
49Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
50. • Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
• Selected optimization method: trust region-based for nonlinear
constrained optimization
50Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
51. • Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
• Selected optimization method: trust region-based for nonlinear
constrained optimization
• Alternatives and augmentations:
pure fine-grained brute force with step size 10 by 10 by 10 (BF-10)
pure coarse-grained brute force with step size 50 by 50 by 50 (BF-50)
pure trust region-based continuous optimization (CO)
continuous optimization with coarse-grained brute force
(Hyb = CO + BF-50)
51Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
52. • Constraint: NP-hard nonlinear integer programming formulation
• Workaround: solving as continuous constrained optimization problem
• Selected optimization method: trust region-based for nonlinear
constrained optimization
• Alternatives and augmentations:
pure fine-grained brute force with step size 10 by 10 by 10 (BF-10)
pure coarse-grained brute force with step size 50 by 50 by 50 (BF-50)
pure trust region-based continuous optimization (CO)
continuous optimization with coarse-grained brute force
(Hyb = CO + BF-50)
52Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
53. • Parameters:
For all apps - ms and RPS
Limit on CPU: 3000 mCPUs
Number of tests: 10 for both soft and hard limits
• Evaluation results:
53Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
54. • Evaluation results:
• Conclusions on method selection for resource allocation:
Hybrid method is more accurate than pure continuous optimization
Fine-grained brute force is out of option due to high execution time
Coarse-grained brute force has good accuracy and low execution time
but scales badly
Hybrid method finds a good balance between execution time and the
quality of solution
54Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
55. 55Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
SLO-compliant Resource Allocation: Design
Hence,
We will allocate the resources
with the hybrid approach
57. • Validation test consists of two parts:
Preliminary Validation Test (PVT) to acquire SLIs values used as in
application-wise performance model (16 times with ab); no optimization
Evaluation Validation Test (EVT) to conduct the real evaluation based
on values from PVT (16 times with ab); optimization is done with hybrid
approach
57Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Evaluation Method
58. • Test settings:
PVT:
EVT:
58Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Evaluation Method
Result of PVT
59. • The approach proven to be appropriate for the installation and SLOs:
at most 2 SLO violations out of 16 trials for 99%-tile response time
at most 1 SLO violation out of 16 trials for the throughput
59Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Evaluation Results
60. • Simplistic dataset that does not allow to proof whether the approach is
feasible for more complex applications
• Performance models susceptible to influences of events that are not
reflected by the input variables (such as garbage collection in Java apps)
• Cost of the resources is not taken into account
• Focus on CPU
60Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Limitations of the Approach
61. Usage Scenario:
Kubernetes’ Vertical Pods Autoscaler (VPA)
Augmenting VPA
61Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
62. • Known Kubernetes autoscaling options:
Horizontal Pod Autoscaler (HPA) // production-ready; allows to change number of pods
Cluster Autoscaler (CA) // production-ready; allows to change number of instances for
various cloud providers overriding native autoscalers
Vertical Pod Autoscaler (VPA) // beta; allows to change amount of resources (CPU,
memory) allocated to pod, but requires pod restart
Addon Resizer (AR) // beta; simplified VPA that modifies resource requests based on the
number of nodes
• All options utilize reactive approach
• HPA supports arbitrary scaling metrics
• VPA and HPA are currently incompatible when scaling on memory/CPU
62Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Autoscaling in Kubernetes
63. Vertical Pod Autoscaler (VPA) sets the resource requests automatically based on usage and
thus allowing the proper scheduling onto nodes so that appropriate resource amount is
available for each pod1)
Core components of VPA – Recommender and Updater
63Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Vertical Pod Autoscaler
1) https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
64. Recommender computes the recommended resource requests for pods based on current and
historical usage of resources.
First, it equally shares the minimal amount of resources for containers in the given pod (by
default – 250 Mb of memory, 25 mCPUs):
𝒓𝒊 =
𝟏
𝒏𝒋
∙ 𝑹𝒋
where for the given resource type: 𝑟𝑖 is the minimal amount of the given resources for ith
container of jth pod, 𝑛𝑗 is the number of containers in jth pod, 𝑅𝑗 is the minimal amount of the
given resource for jth pod.
64Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
VPA: Recommender (1)
65. Second, it utilizes three chains of estimators to produce target estimate, lower bound
estimate and the upper bound estimate of the resources to allocate to the pod. Each chain
contains percentile estimator (90%, 50%, 95% corr.) and margin estimator (15% overhead)
that follows it. For lower and upper bounds, the confidence multiplier is added (𝑘 =
1 + 𝑚 𝑑 𝑒
). Max of these estimates and of minimal amount of resource from before is
selected. Input is resource usage data collected for 𝑑 days.
So, for either of two resource types (CPU, memory) we have:
𝑅𝑗 = max 𝒓𝒊; 1.15 ∙ 𝑟𝑖,90%
𝑛 𝑗
𝑖=1
𝑅𝑗 = max 𝒓𝒊; 1.15 ∙ 𝑟𝑖,50% ∙ 1 +
0.001
𝑑
−2𝑛 𝑗
𝑖=1
𝑅𝑗 = max 𝒓𝒊; 1.15 ∙ 𝑟𝑖,95% ∙ 1 +
1
𝑑
𝑛 𝑗
𝑖=1
65Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
VPA: Recommender (2)
66. Updater runs in Kubernetes cluster and decides which pods should be restarted based on
resources allocation recommendation calculated by Recommender. Practically speaking,
Updater evicts the pods to be updated, whereas the actual recreation of pods with new
resource requests is shifted to the particular controller of pods (e.g. Deployment/ReplicaSet).
The only noticeable thing about the Updater is that pods are updated in the order of priority.
Update priority is proportional to fraction by which resources should be increased / decreased.
Hence, the update priority for the jth pod is computed as follows:
𝑝𝑗 =
𝐶𝑃𝑈𝑖
𝑅𝑒𝑞
− 𝐶𝑃𝑈𝑖
𝑅𝑒𝑐𝑛 𝑗
𝑖=1
𝑛 𝑗
𝑖=1
𝐶𝑃𝑈𝑖
𝑅𝑒𝑞𝑛 𝑗
𝑖=1
+
𝑚𝑒𝑚𝑖
𝑅𝑒𝑞
− 𝑚𝑒𝑚𝑖
𝑅𝑒𝑐𝑛 𝑗
𝑖=1
𝑛 𝑗
𝑖=1
𝑚𝑒𝑚𝑖
𝑅𝑒𝑞𝑛 𝑗
𝑖=1
Currently the only supported update strategy is based on pods restarts.
66Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
VPA: Updater
67. • Add relevant VPA settings, e.g. SLOs, number of trials to get data
• Add VPA Performance Data Collector
• Add VPA Recommender option to use the presented approach
for SLO-compliant resource allocation
• Augment VPA Updater and runtimes to avoid restart of pods
on vertical scaling
67Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Augmenting VPA: the Proposal
68. Conclusions & Future Work
68Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
69. • Major contribution:
approach to the SLO-compliant resource allocation problem for co-
located containerized applications with the following steps:
1. Collection SLI values for various resource limits and workload rates
2. Removing anomalies that cannot be explained through available
features via clustering
3. Learning prediction models relating SLIs to parameters of workload
and resource limits
4. Deriving the resource limits for applications deployment via
continuous optimization and limited brute force search for known
SLOs
the approach is validated with at most 2 SLO violations among 16 trials
69Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Conclusions
70. • Additional findings:
an approach to select the model features and parameters thereof in
order to increase the accuracy of the SLI prediction model
lasso regression-based models of degree 2 seem to suffice for
predicting SLIs
70Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Conclusions
71. • Augment the approach and repeat the study for larger and more realistic
set of apps
71Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
72. • Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
72Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
73. • Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
• Evaluation of artificial neural networks for predicting SLIs
73Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
74. • Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
• Evaluation of artificial neural networks for predicting SLIs
• SLO-compliant resource allocation for individual microservices of
compound applications
74Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work
75. • Augment the approach and repeat the study for larger and more realistic
set of apps
• Evaluation of the impact of runtime/technology-specific behaviors like
garbage collection on SLIs and search for predictors for such behaviors
• Evaluation of artificial neural networks for predicting SLIs
• SLO-compliant resource allocation for individual microservices of
compound applications
• Derivation of models for allocation of other resources like RAM
75Panos Patros & Vladimir Podolskiy | Maintaining SLOs of Cloud-native Applications via Self-Adaptive…
Future Work