TRAINING THE NEXT GENERATION OF EUROPEAN FOG COMPUTING EXPERTS
Container orchestration in
geo-distributed cloud computing platforms
Keynote at HotCloudPerf
April 20th 2021
Mulugeta Ayalew Tamiru, Guillaume Pierre, Johan Tordsson and Erik Elmroth
Elastisys AB & Université de Rennes 1
1
Geo-distributed cloud platforms
2
Fault tolerance Proximity
Resource aggregation Regulatory compliance
Goal: reliably deploy software across the full platform
▪ Containers everywhere
• To abstract ourselves from heterogeneity of the host hardware +
hypervisors
▪ Deploy potentially large numbers of containers
• If necessary: burst to a public cloud
▪ Control container placements
• Manually
• Semi-automatically: “as close as possible from X”
• Automatically: load-balanced across all locations
3
Kubernetes Federation (KubeFed)
▪ Resource management and
application deployment on
multiple Kubernetes clusters
(member clusters) from a
single control plane (host
cluster)
▪ BUT: KubeFed was not
specifically designed for
worldwide geo-distribution
4
Experimental setup
5
▪ 1 host cluster and 5 member
clusters with Kubernetes 1.14
▪ Each cluster with a master
and five worker nodes
▪ Host cluster nodes: 4vCPUs,
16GB RAM
▪ Member cluster nodes:
4vCPUs, 4 GB RAM
▪ Simple nginx web server app
Problem -- Instability
6
Stability
Impact of network configuration on stability
7
AVERAGE NO . OF TIMEOUT ERRORS PER MINUTE (N ) AND STABILITY (υ) OF THE UNCONTROLLED
SYSTEM FOR THE THREE EVALUATION SCENARIOS .
Network delay/ packet
loss rate increased
Cluster failure
Network delay/ packet
loss rate restored
Cluster restored
KubeFed configuration parameters
8
Parameter Default
Cluster Available Delay 20s
Cluster Unavailable Delay 60s
Leader Elect Lease Duration 15s
Leader Elect Renew Deadline 10s
Leader Elect Retry Period 5s
Cluster Health Check Timeout 3s
Cluster Health Check Period 10s
Cluster Health Check Failure
Threshold
3
Stability vs. failure detection delay
9
Solution -- Controller to adjust CHCT at run-time
10
Results -- Stationary scenario
11
Results -- Network variability scenario
12
Network delay/ packet
loss rate increased
Network delay/ packet
loss rate restored
Results -- Cluster failure scenario
13
Cluster failure Cluster
restored
(Temporary) conclusion
▪ We observe significant instability in KubeFed-based
geo-distributed fog platforms due to:
• poor network conditions
• default / static configuration parameters
▪ We designed a proportional controller to adjust CHCT at
run-time
• Improves the system stability from 83–92% with no controller to
99.5–100% using the controller
Mulugeta Tamiru, Guillaume Pierre, Johan Tordsson, Erik Elmroth. Instability in Geo-Distributed Kubernetes Federation:
Causes and Mitigation. In Proceedings of IEEE MASCOTS, Nov 2020.
14
Now that we fixed the instability problem, is KubeFed ready
to manage large-scale geo-distributed platforms?
Note quite: in KubeFed, any deployment request is pushed to the
requested cluster regardless of the resource availability in this cluster.
15
Let’s replay 1 hour of
Google cluster trace,
distribute jobs to one out
of 5 clusters according to
a binomial distribution:
▪ 3 overloaded clusters
▪ 2 mostly idle clusters
Problems to address
▪ Make sure applications are not deployed in overloaded clusters
• Even if this requires choosing another cluster automatically…
▪ Support application autoscaling in multi-cluster environments
• Vary the number of replicas within a single cluster…
• … or across multiple clusters
▪ Allow the system to burst out to a public cloud in case of resource
overload
• And retract public-cloud resources as early as possible
▪ Seamlessly integrate in existing KubeFed platforms
16
17
Deploy mcd-app-1 across two clusters
which receive most network traffic
Make sure end-user requests are
distributed across both clusters
18
Autoscale the application deployment
to maintain reasonable CPU usage
Dynamically provision more resources
from the public cloud if necessary
19
Conclusion
Geo-distributed Kubernetes federations are now:
▪ Stable
▪ Resource availability aware
▪ Network traffic and network latency aware
▪ Burstable between available clusters, and to the public cloud
mck8s is available: https://github.com/moule3053/mck8s
Mulugeta Tamiru, Guillaume Pierre, Johan Tordsson, Erik Elmroth. mck8s: an orchestration platform for geo-distributed
multi-cluster environments. In Proceedings of ICCCN, Jul 2021.
20
The FogGuru project has received funding from the European Union’s
Horizon 2020 research and innovation programme under the Marie
Skłodowska-Curie grant 765452.
TRAINING THE NEXT GENERATION
OF EUROPEAN FOG COMPUTING EXPERTS
www.fogguru.eu
21

Container orchestration in geo-distributed cloud computing platforms

  • 1.
    TRAINING THE NEXTGENERATION OF EUROPEAN FOG COMPUTING EXPERTS Container orchestration in geo-distributed cloud computing platforms Keynote at HotCloudPerf April 20th 2021 Mulugeta Ayalew Tamiru, Guillaume Pierre, Johan Tordsson and Erik Elmroth Elastisys AB & Université de Rennes 1 1
  • 2.
    Geo-distributed cloud platforms 2 Faulttolerance Proximity Resource aggregation Regulatory compliance
  • 3.
    Goal: reliably deploysoftware across the full platform ▪ Containers everywhere • To abstract ourselves from heterogeneity of the host hardware + hypervisors ▪ Deploy potentially large numbers of containers • If necessary: burst to a public cloud ▪ Control container placements • Manually • Semi-automatically: “as close as possible from X” • Automatically: load-balanced across all locations 3
  • 4.
    Kubernetes Federation (KubeFed) ▪Resource management and application deployment on multiple Kubernetes clusters (member clusters) from a single control plane (host cluster) ▪ BUT: KubeFed was not specifically designed for worldwide geo-distribution 4
  • 5.
    Experimental setup 5 ▪ 1host cluster and 5 member clusters with Kubernetes 1.14 ▪ Each cluster with a master and five worker nodes ▪ Host cluster nodes: 4vCPUs, 16GB RAM ▪ Member cluster nodes: 4vCPUs, 4 GB RAM ▪ Simple nginx web server app
  • 6.
  • 7.
    Impact of networkconfiguration on stability 7 AVERAGE NO . OF TIMEOUT ERRORS PER MINUTE (N ) AND STABILITY (υ) OF THE UNCONTROLLED SYSTEM FOR THE THREE EVALUATION SCENARIOS . Network delay/ packet loss rate increased Cluster failure Network delay/ packet loss rate restored Cluster restored
  • 8.
    KubeFed configuration parameters 8 ParameterDefault Cluster Available Delay 20s Cluster Unavailable Delay 60s Leader Elect Lease Duration 15s Leader Elect Renew Deadline 10s Leader Elect Retry Period 5s Cluster Health Check Timeout 3s Cluster Health Check Period 10s Cluster Health Check Failure Threshold 3
  • 9.
    Stability vs. failuredetection delay 9
  • 10.
    Solution -- Controllerto adjust CHCT at run-time 10
  • 11.
  • 12.
    Results -- Networkvariability scenario 12 Network delay/ packet loss rate increased Network delay/ packet loss rate restored
  • 13.
    Results -- Clusterfailure scenario 13 Cluster failure Cluster restored
  • 14.
    (Temporary) conclusion ▪ Weobserve significant instability in KubeFed-based geo-distributed fog platforms due to: • poor network conditions • default / static configuration parameters ▪ We designed a proportional controller to adjust CHCT at run-time • Improves the system stability from 83–92% with no controller to 99.5–100% using the controller Mulugeta Tamiru, Guillaume Pierre, Johan Tordsson, Erik Elmroth. Instability in Geo-Distributed Kubernetes Federation: Causes and Mitigation. In Proceedings of IEEE MASCOTS, Nov 2020. 14
  • 15.
    Now that wefixed the instability problem, is KubeFed ready to manage large-scale geo-distributed platforms? Note quite: in KubeFed, any deployment request is pushed to the requested cluster regardless of the resource availability in this cluster. 15 Let’s replay 1 hour of Google cluster trace, distribute jobs to one out of 5 clusters according to a binomial distribution: ▪ 3 overloaded clusters ▪ 2 mostly idle clusters
  • 16.
    Problems to address ▪Make sure applications are not deployed in overloaded clusters • Even if this requires choosing another cluster automatically… ▪ Support application autoscaling in multi-cluster environments • Vary the number of replicas within a single cluster… • … or across multiple clusters ▪ Allow the system to burst out to a public cloud in case of resource overload • And retract public-cloud resources as early as possible ▪ Seamlessly integrate in existing KubeFed platforms 16
  • 17.
    17 Deploy mcd-app-1 acrosstwo clusters which receive most network traffic Make sure end-user requests are distributed across both clusters
  • 18.
    18 Autoscale the applicationdeployment to maintain reasonable CPU usage Dynamically provision more resources from the public cloud if necessary
  • 19.
  • 20.
    Conclusion Geo-distributed Kubernetes federationsare now: ▪ Stable ▪ Resource availability aware ▪ Network traffic and network latency aware ▪ Burstable between available clusters, and to the public cloud mck8s is available: https://github.com/moule3053/mck8s Mulugeta Tamiru, Guillaume Pierre, Johan Tordsson, Erik Elmroth. mck8s: an orchestration platform for geo-distributed multi-cluster environments. In Proceedings of ICCCN, Jul 2021. 20
  • 21.
    The FogGuru projecthas received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant 765452. TRAINING THE NEXT GENERATION OF EUROPEAN FOG COMPUTING EXPERTS www.fogguru.eu 21