Making Kubernetes
Production Ready
Harry Zhang
Abhinav Das
VOIP or Dial-in (see chat)
Questions? Send via the GTW ‘questions’ chat
• We have about 40 minutes of content but more time for questions
• We will post slides today and email a video by Monday
• Send any questions on the GTM chat
• If audio fails, let us know on chat! We will re-dial in quickly…
• Apologies for the train that goes by at about :24 minute mark 
But first, some quick housekeeping
July 21, 2017 2
Who are we?
July 21, 2017 3
Abhinav DasHarry Zhang
A Short Poll
About Applatix
• Platform to build and run containerized apps in cloud.
▪ Built on Kubernetes
• Simplify the journey to cloud with:
▪ Infrastructure automation
▪ End to end DevOps workflows
▪ Monitoring, audit and governance
Outline
• What is “Production Ready” to us
• Kubernetes Design at a Glance
• How we hardened Kubernetes Master
• How we hardened Kubernetes Minion
What is “Production Ready”?
Our workload
Our Workload
Workflow 1
Workflow 2
Tasks
Workflow 1
Workflow 2
Tasks
Our Workload
High Pod churn
• Large number of Pods created and deleted in unit time
Current Applatix Production Workload
Everybody talks with API server!
• 20+ Controllers
• All Kubelets
• All Kube-Proxy
• Scheduler
• Add-ons
• Other customized microservices
Default configurations
does not work for us!
Kubernetes At A Glance
Problem 1: Master is Crashing
When master crashes …
Knobs to manage API server
Purpose Flag Rule of Thumb
Throttle API
Requests
--max-request-inflight • We use 1 inflight request per 2 Pods
Control
Memory
Consumption
--target-ram-mb • Configures watch cache and deserialization cache
• We use 2.5 MB per Pod
Knobs to controller manager
Purpose Flag Rule of Thumb
Control level
of parallelism
--concurrent-deployment-syncs
--concurrent-endpoint-syncs
--concurrent-gc-syncs
--concurrent-namespace-syncs
--concurrent-replicaset-syncs
--concurrent-resource-quota-syncs
--concurrent-service-syncs
--concurrent-serviceaccount-token-syncs
--concurrent-rc-syncs
• Set it to large value for
components you use frequently
and require fast response
• For example, our production
cluster can have couple of
hundreds of deployments, we
assigned 20 workers for
deployment syncs, replica set
syncs and replication controller
syncs
Knobs to controller manager
Purpose Flag Rule of Thumb
Control
Memory
Consumption
--replication-controller-lookup-cache-size
--replicaset-lookup-cache-size
--daemonset-lookup-cache-size
• Available to versions prior to 1.6
• We use ~4G/4G/1G respectively
for the 3 flags for production
cluster, and scale them down
based on master resource for
other cluster types
Knobs to control API calls
Purpose Flag Rule of Thumb
Throttle API
query rate
--kube-api-burst
--kube-api-qps
• Maximum inflight API call for API
server should be considered
• We set 3 QPS per 10 Pod for
scheduler, and the number is
doubled for burst
Admission Control
• Another observation was that if we let the creation
of Pods to be unconstrained, Kubernetes master
was unstable
 We have an admission controller that manages the creation of
Pods
 This ensures that we are only creating Pods that will be able
to execute without resource constraints
Further Reduce Master Workload
Problem 2: Minion becomes
“Unhealthy”
Many things can go wrong
What we do
Kernel CFS Bug (Kubernetes Issue #874)
[ 3960.004144] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
[ 3960.008059] IP: [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0
[ 3960.008059] PGD 6e7bd7067 PUD 72813c067 PMD 0
[ 3960.008059] Oops: 0000 [#1] SMP
[ 3960.008059] Modules linked in: xt_statistic(E) xt_nat(E) ......
[ 3960.008059] CPU: 4 PID: 10158 Comm: mysql_tzinfo_to Tainted: G E 4.4.41-k8s #1
[ 3960.008059] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[ 3960.008059] task: ffff8807578fae00 ti: ffff88075f028000 task.ti: ffff88075f028000
[ 3960.008059] RIP: 0010:[<ffffffff810b332f>] [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0
[ 3960.008059] RSP: 0018:ffff88075f02be38 EFLAGS: 00010046
[ 3960.008059] RAX: 0000000000000000 RBX: ffff8807250ff400 RCX: 0000000000000000
[ 3960.008059] RDX: ffff88078fc95e30 RSI: 0000000000000000 RDI: ffff8807250ff400
[ 3960.008059] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88076bc13700
[ 3960.008059] R10: 0000000000001cf7 R11: ffffea001c98a100 R12: 0000000000015dc0
[ 3960.008059] R13: 0000000000000000 R14: ffff88078fc95dc0 R15: 0000000000000004
[ 3960.008059] FS: 00007fa34b7f6740(0000) GS:ffff88078fc80000(0000) knlGS:0000000000000000
[ 3960.008059] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3960.008059] CR2: 0000000000000080 CR3: 000000067762d000 CR4: 00000000001406e0
[ 3960.008059] Stack:
[ 3960.008059] ffff8807578fae00 0000000000001000 0000000200000000 0000000000015dc0
[ 3960.008059] ffff88078fc95e30 00007fa34b7fc000 000000005ef04228 ffff88078fc95dc0
[ 3960.008059] ffff8807578fae00 0000000000015dc0 0000000000000000 ffff8807578fb2a0
[ 3960.008059] Call Trace:
[ 3960.008059] [<ffffffff8159cd1f>] ? __schedule+0xdf/0x960
[ 3960.008059] [<ffffffff8159d5d1>] ? schedule+0x31/0x80
[ 3960.008059] [<ffffffff810031cb>] ? exit_to_usermode_loop+0x6b/0xc0
[ 3960.008059] [<ffffffff81003bcf>] ? syscall_return_slowpath+0x8f/0x110
[ 3960.008059] [<ffffffff815a1518>] ? int_ret_from_sys_call+0x25/0x8f
[ 3960.008059] Code: c6 44 24 17 00 eb ......
[ 3960.008059] RIP [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0
[ 3960.008059] RSP <ffff88075f02be38>
[ 3960.008059] CR2: 0000000000000080
[ 3960.008059] ---[ end trace e1b9f0775b83e8e3 ]---
[ 3960.008059] Kernel panic - not syncing: Fatal exception
What we do
What we do
Summary
• Kubernetes resource consumption is directly related
to number of Pods and Pod churns
• Find a balance among performance, stability, and cost
• Kubernetes is stable and production ready
Thank You!
Q&A

Webcast - Making kubernetes production ready

  • 1.
    Making Kubernetes Production Ready HarryZhang Abhinav Das VOIP or Dial-in (see chat) Questions? Send via the GTW ‘questions’ chat
  • 2.
    • We haveabout 40 minutes of content but more time for questions • We will post slides today and email a video by Monday • Send any questions on the GTM chat • If audio fails, let us know on chat! We will re-dial in quickly… • Apologies for the train that goes by at about :24 minute mark  But first, some quick housekeeping July 21, 2017 2
  • 3.
    Who are we? July21, 2017 3 Abhinav DasHarry Zhang
  • 4.
  • 5.
    About Applatix • Platformto build and run containerized apps in cloud. ▪ Built on Kubernetes • Simplify the journey to cloud with: ▪ Infrastructure automation ▪ End to end DevOps workflows ▪ Monitoring, audit and governance
  • 6.
    Outline • What is“Production Ready” to us • Kubernetes Design at a Glance • How we hardened Kubernetes Master • How we hardened Kubernetes Minion
  • 7.
  • 8.
  • 9.
    Our Workload Workflow 1 Workflow2 Tasks Workflow 1 Workflow 2 Tasks
  • 10.
    Our Workload High Podchurn • Large number of Pods created and deleted in unit time
  • 12.
  • 13.
    Everybody talks withAPI server! • 20+ Controllers • All Kubelets • All Kube-Proxy • Scheduler • Add-ons • Other customized microservices Default configurations does not work for us! Kubernetes At A Glance
  • 14.
    Problem 1: Masteris Crashing
  • 15.
  • 16.
    Knobs to manageAPI server Purpose Flag Rule of Thumb Throttle API Requests --max-request-inflight • We use 1 inflight request per 2 Pods Control Memory Consumption --target-ram-mb • Configures watch cache and deserialization cache • We use 2.5 MB per Pod
  • 17.
    Knobs to controllermanager Purpose Flag Rule of Thumb Control level of parallelism --concurrent-deployment-syncs --concurrent-endpoint-syncs --concurrent-gc-syncs --concurrent-namespace-syncs --concurrent-replicaset-syncs --concurrent-resource-quota-syncs --concurrent-service-syncs --concurrent-serviceaccount-token-syncs --concurrent-rc-syncs • Set it to large value for components you use frequently and require fast response • For example, our production cluster can have couple of hundreds of deployments, we assigned 20 workers for deployment syncs, replica set syncs and replication controller syncs
  • 18.
    Knobs to controllermanager Purpose Flag Rule of Thumb Control Memory Consumption --replication-controller-lookup-cache-size --replicaset-lookup-cache-size --daemonset-lookup-cache-size • Available to versions prior to 1.6 • We use ~4G/4G/1G respectively for the 3 flags for production cluster, and scale them down based on master resource for other cluster types
  • 19.
    Knobs to controlAPI calls Purpose Flag Rule of Thumb Throttle API query rate --kube-api-burst --kube-api-qps • Maximum inflight API call for API server should be considered • We set 3 QPS per 10 Pod for scheduler, and the number is doubled for burst
  • 20.
    Admission Control • Anotherobservation was that if we let the creation of Pods to be unconstrained, Kubernetes master was unstable  We have an admission controller that manages the creation of Pods  This ensures that we are only creating Pods that will be able to execute without resource constraints
  • 21.
  • 22.
    Problem 2: Minionbecomes “Unhealthy”
  • 23.
  • 24.
  • 25.
    Kernel CFS Bug(Kubernetes Issue #874) [ 3960.004144] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [ 3960.008059] IP: [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0 [ 3960.008059] PGD 6e7bd7067 PUD 72813c067 PMD 0 [ 3960.008059] Oops: 0000 [#1] SMP [ 3960.008059] Modules linked in: xt_statistic(E) xt_nat(E) ...... [ 3960.008059] CPU: 4 PID: 10158 Comm: mysql_tzinfo_to Tainted: G E 4.4.41-k8s #1 [ 3960.008059] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016 [ 3960.008059] task: ffff8807578fae00 ti: ffff88075f028000 task.ti: ffff88075f028000 [ 3960.008059] RIP: 0010:[<ffffffff810b332f>] [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0 [ 3960.008059] RSP: 0018:ffff88075f02be38 EFLAGS: 00010046 [ 3960.008059] RAX: 0000000000000000 RBX: ffff8807250ff400 RCX: 0000000000000000 [ 3960.008059] RDX: ffff88078fc95e30 RSI: 0000000000000000 RDI: ffff8807250ff400 [ 3960.008059] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88076bc13700 [ 3960.008059] R10: 0000000000001cf7 R11: ffffea001c98a100 R12: 0000000000015dc0 [ 3960.008059] R13: 0000000000000000 R14: ffff88078fc95dc0 R15: 0000000000000004 [ 3960.008059] FS: 00007fa34b7f6740(0000) GS:ffff88078fc80000(0000) knlGS:0000000000000000 [ 3960.008059] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3960.008059] CR2: 0000000000000080 CR3: 000000067762d000 CR4: 00000000001406e0 [ 3960.008059] Stack: [ 3960.008059] ffff8807578fae00 0000000000001000 0000000200000000 0000000000015dc0 [ 3960.008059] ffff88078fc95e30 00007fa34b7fc000 000000005ef04228 ffff88078fc95dc0 [ 3960.008059] ffff8807578fae00 0000000000015dc0 0000000000000000 ffff8807578fb2a0 [ 3960.008059] Call Trace: [ 3960.008059] [<ffffffff8159cd1f>] ? __schedule+0xdf/0x960 [ 3960.008059] [<ffffffff8159d5d1>] ? schedule+0x31/0x80 [ 3960.008059] [<ffffffff810031cb>] ? exit_to_usermode_loop+0x6b/0xc0 [ 3960.008059] [<ffffffff81003bcf>] ? syscall_return_slowpath+0x8f/0x110 [ 3960.008059] [<ffffffff815a1518>] ? int_ret_from_sys_call+0x25/0x8f [ 3960.008059] Code: c6 44 24 17 00 eb ...... [ 3960.008059] RIP [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0 [ 3960.008059] RSP <ffff88075f02be38> [ 3960.008059] CR2: 0000000000000080 [ 3960.008059] ---[ end trace e1b9f0775b83e8e3 ]--- [ 3960.008059] Kernel panic - not syncing: Fatal exception
  • 26.
  • 27.
  • 28.
    Summary • Kubernetes resourceconsumption is directly related to number of Pods and Pod churns • Find a balance among performance, stability, and cost • Kubernetes is stable and production ready
  • 29.