I explain how to use Requests and Horizontal Pod Autoscaler to autoscale an application. Here with yaml example of our geolocation app at https://tech.m6web.fr/
This talk were given at our Last Friday Talk oct. 18.
Questions & Answers:
Q1: Is it relevant to put high values on the requests?
The value of the requests is taken into account for the triggering of the HPA.
If the app consumes a lot of resources, then yes,
If it consumes little, then autoscaling will be triggered late, or not at all (it will crash before)
In all cases, the application must hold the load beyond the value of the requests: it can consume more
Q2: Is it relevant to have a very high max HPA?
Yes if the app can consume these resources under normal circumstances,
On the other hand to have a max HPA at 1000 time the max value of the application has little interest
It's more like a safeguard if you ever have a bug and you consume too much
Q3: Custom Metrics are defined at the request level?
No, Requests are the CPU and RAM, notions defined at the level of app containers.
Metrics , custom or not, are used to define the HPA target: it is therefore defined at the HPA level
Q4: What is the price of putting a very high max HPA?
None: these requests are not reserved until the pods are launched
So it doesn't cost anything, it's just protection
Q5: What is the waiting time to launch an additional node?
It depends on the cloud provider,
At AWS, for the moment, it's between 3 and 5 minutes,
So it's not instantaneous and it can be problematic in very high peak loads (we look at overprovisioning)
Q6: What is the waiting time to scroll pods?
A few seconds: we start new containers that are created very quickly
We use docker containers for the moment, but Kubernetes is not restrictive to this.
Q7: Can we scroll on a metric history?
Not really. We scale according to a metric, on current values,
The purpose of kubernetes is to have an infra that automatically scales according to the current load.
Predicting a load is not part of its objectives.
However, it is still something that can be done depending on the Prometheus request we make
28. Requests
Containers can request ressources
● CPU
● RAM
Those ressources are guaranteed
Helps Kubernetes to optimize the
scheduling of pods
Photo by Kaizen Nguyễn on Unsplash
29. Requests are the
highest resources
an app will
normally consume
They must not be considered as
minimum or maximum usable
resources
Requests are not Limits
35. Example:
geolocation at
16h30
● Pod Requests: 300m
(millicores)
● Horizontal Pod Autoscaler
target: 80%
● HPA scales from 240m
CPU: 215m CPU: 205m
Mean CPU: 210m
70% of Requests
No need to scale
36. Example:
geolocation at
18h30
● Pod Requests: 300m
(millicores)
● Horizontal Pod Autoscaler
target: 80%
● HPA scales from 240m
CPU: 282m CPU: 264m
Mean CPU: 273m
91% of Requests
need to scale up
50. By container metrics
● Horizontal Pod Autoscaler compares containers metrics overall pods
● It does the same calculation as above, but for each different container
● We can therefore scale up because of Nginx or PHP independently
● It always takes the highest value to define the number of replicas
55. Photo by Kalle Kortelainen on Unsplash
questions and
answers available in
comments
Editor's Notes
We often talk about clouds and containers:
The problem: make the apps of the devs work without overprovisioning servers. The cloud makes it possible to ensure that our invoice at the end of the month is consistent with the consumption of our data
This is one of the objectives of k8s:
Optimize resources
Adapt the infra to the real use
Keep applications healthy
Kubernetes achieves these objectives in several ways
Reminder about Kubernetes: I will present the objects used in this presentation
A `pod` is the smallest unit you handle with Kubernetes.
This is an instance of the application for Kubernetes.
If you give a request to a pod, it knows how to answer it: it’s an autonomous instance of an app.
A pod can be composed of several containers.
To meet the load or improve availability, several replicas of the pod can be created.
A `node` is a machine, often virtual, that is part of the cluster.
A `cluster` groups all the machines with which Kubernetes works.
A cluster is composed of master nodes where Kubernetes' internal functionalities run and worker nodes where your applications run.
To adapt to the load, the number of pods of an application is changed.
A `HorizontalPodAutoscaler` or HPA dynamically controls the number of replicas of a pod, often according to their CPU consumption.
If our application consumes a lot: we add pods, if it consumes a little, we remove pods
Here, we have increased the charge. Kubernetes noticed this, and therefore added pods to compensate for this additional load.
A `service` exposes a pod on the network - whether it is the cluster's internal network or the Internet.
It is a single entry point per application.
We always have a single entry point: a service, that we have 30 pods spread over 23 nodes.
It was quick, I take up these notions with an example
The example is our geolocation API, which has been running in production for several months.
The app geo is a POD composed of two containers, PHP and Nginx.
A single pod is enough to respond to an HTTP request.
We deployed this pod in a production cluster, so that it could be executed on a node.
On the production cluster, composed of several worker nodes and several master nodes.
We have defined 2 replicas minimum of our application, to secure its execution if one of the pods crashes.
By deploying two replicas of our pod, Kubernetes scheduled them on nodes in its cluster.
Depending on the load, the HPA will change the number of replicas of our geolocation application
The geo application is accessed through a single entry point on the network: the service.
The evolution of the number of pods is therefore transparent for customers, whether they have 2 pods or 23.
And it's working pretty well!
Here is an example with our last football game:
We started the evening with our minimum of 2 pods and for the peak load of the evening, we climbed to 14 pods of the application.
It's super cool, we autoscale our application depending on the load!
And it's quite new for us sysadmins.
It's beautiful, it sounds almost magical like that, but not at all.
It's YAML and it's in the developers' repositories: you have your hands on it.
There are 2 things to configure:
1) HPA -> The HPA here is configured on CPU consumption with a target at 80% of Container Requests
2) -> Requests
Are reserved resources: Kubernetes will not launch the pod if these resources are not available in the cluster.
To optimize its resources, Kubernetes needs to know the size of the apps: this is the purpose of Requests
If k8s knows the size of the app, it runs it on the right server
Requests are reserved resources. The app can consume more or less.
Requests is the normal consumption of an app, it is the 100% use of the app in good conditions. However, the app must be able to handle more than 100%.
So to be able to autoscale an app in k8s, you need these two elements.
And it works very well! Here, we see with the use of the CPU in relation to the number of pods: the curves evolve in the same way.
More precisely?
Does the HPA makes an average, the median of the pods? The use of nodes?
It is an average: It compares the average consumption of the pods with its target: the 80% of Requests.
We're going back to the previous football game to see the evolution.
Here, at 4:30 p. m., so before the game and the peak load.
At 4:30pm, we had 2 pods, each consuming CPU resources. However, they consumed 70% of the requests, which is less than the HPA target.
So there's no need to scramble.
The same example at 6:30 p. m.: we consume more. We're past the target: there's a need for scale up.
Resources are reserved according to demand
When the consumption of our application approaches the target, we scale
But not all peak loads are the same: how does HPA adapt?
We saw it during our football walk: sometimes we scale 1 pod, sometimes 4 at a time.
Of how much should the HPA scale?
HPA follows a simple formula:
He takes the number of pods,
He looks at how much they consume resources compared to what we would like them to consume,
The result is the number of pods that would have to rotate to respect this load.
In our example of the soccer game at 6:30 p. m,
We had 2 pods, which started consuming a lot of resources
The result of the mathematical calculation is 3 pods. He added 1.
Same example at 8:47 p.m.
We used more resources, the HPA did its calculation, the result is 14 pods, we had 10, he added 4.
That's how we scale our pods:
kubernetes allows us this autoscaling quite simply,
We know how to scroll until the cluster's resources are exhausted,
It is driven by yaml, in the projects of the devs: on which they have the hand
It's the CPU cores and so on, but how does a dev get autonomous on that?
All the previous graphs come from our prod grafana with real prod metrics, accessible by devs.
Example of a dashboard that shows CPU/RAM utilization of a pod
It's yaml: it's easy to change and it's taken into account right away.
And the previous values were changed after the football match, because they were not at all optimal
2 last points before closing the subject
The cloud allows us to add more servers for a few hours.
We use the cluster-autoscaler for this.
HPA controllable by metrics.
By default we use the CPU consumption, but the metrics can be custom as long as it is a metric that exists in Prometheus.
You now know how application autoscaling works in kubernetes and you have the possibility to do it independently
questions:
Is it relevant to put high values on the requests?
The value of the requests is taken into account for the triggering of the HPA.
If the app consumes a lot of resources, then yes,
If it consumes little, then autoscaling will be triggered late, or not at all (it will crash before)
In all cases, the application must hold the load beyond the value of the requests: it can consume more
Is it relevant to have a very high max HPA?
Yes if the app can consume these resources under normal circumstances,
On the other hand to have a max HPA at 1000 time the max value of the application has little interest
It's more like a safeguard if you ever have a bug and you consume too much
Custom Metrics are defined at the request level?
No, Requests are the CPU and RAM, notions defined at the level of app containers.
Metrics , custom or not, are used to define the HPA target: it is therefore defined at the HPA level
What is the price of putting a very high max HPA?
None: these requests are not reserved until the pods are launched
So it doesn't cost anything, it's just protection
What is the waiting time to launch an additional node?
It depends on the cloud provider,
At AWS, for the moment, it's between 3 and 5 minutes,
So it's not instantaneous and it can be problematic in very high peak loads (we look at overprovisioning)
What is the waiting time to scroll pods?
A few seconds: we start new containers that are created very quickly
We use docker containers for the moment, but Kubernetes is not restrictive to this.
Can we scroll on a metric history?
Not really. We scale according to a metric, on current values,
The purpose of kubernetes is to have an infra that automatically scales according to the current load.
Predicting a load is not part of its objectives.
However, it is still something that can be done depending on the Prometheus request we make