Testing kubernetes and_open_shift_at_scale_20170209

Testing Kubernetes and OpenShift
@ Scale
Mike Fiedler - OpenShift System Test

Agenda
● Kubernetes/OpenShift runtimes & scalability goals
● OpenShift system testing: what does it cover?
● Installing large clusters
● Scalability test tools (the Kubernetes performance
test repo and the the OpenShift SVT repo)
● Sample results

K8s and OpenShift runtimes
● Primarily targeted at cloud platforms
○ Amazon EC2, Google Cloud Platform, Microsoft Azure
○ Enterprise-hosted cloud offerings/infra
○ On-prem cloud infra such as OpenStack
○ Bare metal and other virtualization environments, too
● Cluster sizes from all-in-one dev/sandbox to
multi-master, 1000+ nodes or federated clusters

Persistent Volume StorageNodes
node
1
node
2
EBS
(Persistent
Volumes)
S3 (Registry)
node
1000
Control Plane
master1
+ etcd1
SSD
master2
+ etcd2
SSD
master3
+ etcd3
SSD
Infrastructure Group
infra2:
HAProxy router2
docker-registry2
infra1:
HAProxy router1
docker-registry1
Application
ELB
(Routes)
External
ELB
(Console)
Internet
Int
ELB
(Nodes)
What does a cluster look like?
AWS sample:

Kubernetes SIG-scale
● Scalability special interest group
○ https://github.com/kubernetes/community/tree/master/sig-scalability
● Container workload is what matters - listen to your applications
○ The numbers here are more “control plane” - think small pods/containers
● Stated future goals:
○ Assumption: core/node = 64 (higher in the future)
○ Pods/core = 10 (depends on workload)
○ Pods/node = 500 - 640 (depends on workload, these would be small pods)
○ nodes/cluster = 5000
○ pods/cluster = 500,000 (note: less than node x pods/node)
○ pod startup time < 5 seconds
○ Schedule 100 pods/second

Current OpenShift numbers
● Nodes/cluster = 1000
● Pods/core = 10 (default, tunable)
● Pods/node = 250

System Test team in Red Hat
● Kubernetes and OpenShift Scalability
○ Cluster horizontal scale
■ # of nodes
■ # of running pods across all nodes
■ application traffic
○ Node vertical scale
■ # of pods running on a single node
■ workload that a single node can support (applications, builds, storage)
○ Application scalability
■ Scale # of application replicas up/down

● Performance
○ Resource usage and response times for scenarios and workloads
■ Application workload and access performance
■ Builds (OpenShift)
■ Metrics and Log collection
○ OpenShift infrastructure performance
■ Resource usage of processes under load
■ Network (SDN) throughput
■ Routing
■ Storage (EBS, Ceph, Gluster, Cinder, etc)

● Reliability
○ Simulated user workloads
■ monthly, weekly, daily, hourly, minute activities
■ accelerated to run faster than real-time
○ Run for extended periods and measure CPU, memory, I/O,
network over time

SVT Challenges/Fun
● Installation
○ 1000+ node installs are time consuming (multiple hours)
○ On public cloud providers, time = $$$. Maximize time testing
○ 500 node test cluster on AWS is around USD $1500 - 2000/day
● Verifying that a cluster is viable
○ Don’t waste time on buggy clusters
● Loading up a cluster with application containers
● Putting a workload on the cluster
● Collecting performance data in large clusters

Install
● Upstream Kubernetes has a variety of install methods
○ Scripted, kubeadm, GKE, Ansible
● OpenShift install is Ansible-based
○ RPM install + pull of docker images. Containerized install available, too.
○ Network intensive - try to minimize downloads
● OpenShift SVT Gold image provisioner
○ Watches for new builds of OpenShift - multiple per week
○ Creates AMI and qcow2 images for every OpenShift puddle
○ RHEL OS setup, filesystem setup, tools
○ Pre-install OpenShift RPMs and pre-pull all docker images
○ Clone git repos and install performance tools

Install
● Ansible install and then verify cluster “core”
○ masters, etcd, load balancer, infrastructure + 3-5 app nodes
○ run e2e conformance tests (more on this later)
● Scale up with additional application nodes
● Playbooks:
○ github.com/openshift/openshift-ansible/playbooks/byo/config.yml
○ github.com/openshift/openshift-ansible/playbooks/byo/openshift-node/scaleup.yml

Persistent Volume StorageNodes
node
1
node
2
EBS
(Persistent
Volumes)
S3
(Registry)
node
1000
Control Plane
master1
+ etcd1
SSD
master2
+ etcd2
SSD
master3
+ etcd3
SSD
Infra Group infra2:
HAProxy
router2
docker-registry2
infra1:
HAProxy
router1
docker-registry1
Application
ELB
(Routes)
External
ELB
(Console)
Internet
Int
ELB
(Nodes)

Kubernetes e2e and perf test
● e2e (end-to-end) tests
○ https://github.com/kubernetes/community/blob/master/contributors/devel/e2e-te
sts.md
○ Subset of e2e tests are tagged as Conformance.
○ Conformance = minimum supported functionality for operational cluster
○ OpenShift also adds some additional Conformance tests if you yum install
atomic-openshift-tests on top of OpenShift
● Performance tests
○ https://github.com/kubernetes/perf-tests
○ Work in progress

OpenShift SVT repo
● https://github.com/openshift/svt
● Tools for OpenShift performance, scale, reliability
○ cluster load-up
○ traffic generation
○ concurrent builds, deployments, pod start/stop
○ reliability testing
○ network performance
○ logging and metrics tests
● Automated and executed from Jenkins

Cluster load-up
● cluster-loader - python tool to quickly load clusters according to a YAML test
specification. Takes advantage of OpenShift’s template capabilities
● Can be used with Kubernetes or OpenShift
● SVT repository has sample YAML configurations for node vertical, cluster horizontal,
“Quick Start” applications with and without persistent storage.
“I want an environment with thousands of deployments, pods (with persistent storage), build
configurations, routes, services, secrets and more…”
projects:
- num: 1000
basename: nginx-explorer
tuning: default
templates:
- num: 10
file: cluster-loader/nginx.yaml
- num: 20
file: cluster-loader/explorer-pod.yaml

Cluster traffic generation
● cluster-loader can also run in traffic generation mode
● Runs a JMeter pod to generate traffic against applications (installed
by cluster-loader or otherwise)
● Hit rate, throughput, response codes, response times, etc
● Discovers applications, exposed routes, etc
● Currently OpenShift only, but working on an upstream version.

Short Demo
Cluster-loader Demo

Performance Tools
● PBench: Performance and Benchmark Analysis
Framework
○ pbench-agent: collection agent and harness for running tests.
■ Collects data from sar, vmstat, iostat, pidstat, perf, etc
■ Extensible: additional data collectors can be added
■ Packages raw data from a test and ships it to pbench-server
○ pbench-server: processes raw data from all systems under test
○ web-server: provides visualization of data
https://github.com/distributed-system-analysis/pbench

Loading 250 pods/node 20 pods at
a time with 3 minute pauses

Master 1 - is the controller leader for
most of the run
Master 2 - has to pick up controller
leader when Master 1 fails
Loading on OSP 8 cluster:
● 500 nodes
● 20K projects
● 52K pods
Masters are 40vCPU and peak out at
22 cores used.

Create/delete hundreds of pods : Amazon EBS IOPs credit exhaustion - AWS “I/O
cliff”
gp2 EBS volumes on EC2 can run “fast” until their IOPS credits are exhausted
After that, they are throttled to 3 iops/gb until credits build back up

Resources
Kubernetes sig-testing: https://kubernetes.slack.com/messages/sig-testing/
Kubernetes sig-scale: https://kubernetes.slack.com/messages/sig-scale/
OpenShift IRC: #openshift-dev
OpenShift SVT repo: https://github.com/openshift/svt

Testing kubernetes and_open_shift_at_scale_20170209

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Testing kubernetes and_open_shift_at_scale_20170209

Similar to Testing kubernetes and_open_shift_at_scale_20170209 (20)

Recently uploaded

Recently uploaded (20)

Testing kubernetes and_open_shift_at_scale_20170209