Container orchestration and
microservices world
Karol Chrapek
the story about containers
orchestration.
Novomatic Technologies Poland
● R&D center for Novomatic
● was established in 1997 (20 years)
● more than 300 specialists
● focusing on high-tech gaming technologies and entertainment market
● more info here: novomatic-tech.com
Why do we need containers in NTP?
● Unified deployment method.
● Accelerate software development, deployment and shipping processes.
● Simplify cooperation with different teams / companies inside the Novomatic group.
● Reduce the need to maintain dev infrastructure in each project.
● Solve problem with some legacy library and hardware.
Container evolution in NTP
● “Think tank team” experiment with container:
○ speed up CI phase
○ simplify deployment and upgrade processes
○ run them everywhere (local test ;))
● TTT created “Container’s Evelen” and showed a few presentations internally.
● More teams decided to use containers for test purposes.
● A few small projects start using docker in production.
● We needed solution for containers’ platform at scale.
● TTT deployed a first Kubernetes dev custer in NTP.
● “DevOps team” took responsibility for K8S stacks.
● DOT created a new clusters inside NTP.
@Hefzul Bari
Developers needs?
● Easy to run and share with other teams.
● Reduce number of issues forwarded to infrastructure team.
● One orchestration method/tool for local and production environments.
● A platform ready for public clouds.
● Support of legacy apps and their dependencies.
● Learn something new.
Business needs?
● Reducing deployment and scalability windows.
● Run on both classes of hardware: commodity and enterprise.
● The same deployment model for different environments and teams.
● Reducing performance degradation window during an failure.
● All new products should increase environment stability.
● Most of our clients require on-premise solution.
Why did we chose kubernetes?
● We tested different tools and we choose one that suits “best” to our model.
● Currently k8s is container orchestration “standard”.
● All main cloud providers are compatible with kubernetes (GKE, AKS, EKS).
● Some clients own on premise Kubernetes infra, some teams prefer cloud providers but software
deployment method stays the same.
● Approved by development teams and clients.
● Open source software.
Development environments
● previous: one k8s cluster provisioned via custom bash scripts
● now: three two k8s clusters provisioned via Kubespray
○ 8-10 nodes
○ all nodes are virtual machines on Cisco stack
● some developers use Minikube
● sometimes additional test envs are exposed by our clients
PaaS - requirements
Operations:
● multi-datacenter
● high availability
● easy to provisioning
● on demand scalability
● security
Developers:
● config management
● secret management
● service discovery
● blue-green deployment
● tracing
Both:
● telemetry
● logging
● self-healing
● rolling update
@Damien Pollet - flickr
Lesson learned
#1 Kubernetes is a distributed platform
#1.1 Kubernetes architecture
#2 Kubernetes as a PaaS core
#2.1 Kubernetes as a PaaS core
Platform [1]:
- Distribution (55)
- Hosted (34)
- Installer (18)
Others:
- Application definition
& Image Build [2]
- Service Proxy [3]
- Service Mesh [4]
- Network [5]
- Security [6]
- Observability [7]
- Storage [8]
1 7
2
3 4
5
6
8
#3 Kubernetes - cutting edge vs prod grade
API components (1.14) Version
CronJob v1beta1
Ingress v1beta1
PodSecurityPolicy v1beta1
CSI Driver v1beta1
#4 Etcd - replication and consistency
Problems:
● Etcd size sometimes starts growing and grows … [#8009]
● Network glitch reducing etcd cluster availability seriously [#7321]
● Test clientv3 balancer under network partitions, other failures [#8711]
@jevans
#5 Kubernetes API
● CoreDNS crash when API server down [#2629]
● CVE-2018-1002105 [#71411]
● When API server down operators and some sidecars /init containers could crash (always HA)
● Kubernetes scheduler and controller crash when they are connected to localhost [#22846 and
#77764 ]
@jevans
#6 Small deployment and edge computing
● edge computing at Chick-fil-A
● Services overhead
● Deployment and monitoring is not so easy.
● Challenge: Cross cluster connections.
#7 Enforcing default limits for containers
>Ja [2:20 PM]
ale widze ostatnio masz twardą rękę do podziałów zasobów po zespołach :)
ja mysle ze w tym tygodniu poprawie te limity i konfiguracje
….
bo mi trochę głupio, że z prostymi problemami się borykamy:
>Kolega XYZ [3:23 PM]
moja babcia zawsze mówiła, że głupio to jest kraść
#8 Run stateful apps
https://twitter.com/kelseyhightower/status/963413508300812295
#9 Operator helps to manage STS but:
● they are complex,
● mostly support 60-80% of all maintaining tasks,
● chose manage services in cloud or classic orchestration for on-premise solution,
● sometimes sts apps version bump required manual operations.
#10 Persistence volumes and k8s on-premise
● NFS - replication is tricky
● Rook operator [ceph or edgeFS] - complex
● Local volume still in Beta https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/
● Expanding Persistent Volumes Claims still in beta
● Flexvolume and CSI driver
#11 App flapping-> connection reset via
ingress
Symptoms: Active connections reset after 5 minutes.
Root cause:
1. Pod rescheduled (container OOM), new pod == new IP.
2. Service add new endpoint -> nginx configuration reload .
3. Nginx conf reload -> wait 5 minutes (worker-shutdown-timeout)
and kill old worker.
Related issue:#2461
nginx.com
#12 Multitenant and RBAC
● Single tenant and multiple clusters or one multi-tenant cluster.
● Universal permission by resource type.
● No field-level access control.
#13 Namespace - resource isolation ;)
https://xkcd.com/2044/
#14 Network Policy
By default network is “flat” inside Kubernetes ;)
Common network policies:
https://github.com/ahmetb/kubernetes-network-policy-recipes
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
namespace: secondary
name: deny-from-other-namespaces
spec:
podSelector:
matchLabels:
ingress:
- from:
- podSelector: {}
#15 Infrastructure resources and stability
Tooling
#1 Application deployment
Happy helming:
● The syntax in hard, especially when you start.
● Secret storing required extra plugin. [helm-secrets]
● Umbrella charts are always tricky. [#4490]
● Helm upgrade failed when new objects added [#4871]
● Tiller and RBAC [Tiller was removed from Helm3, discussion here #1918]
#2 Telemetry
If you like a new and fancy solution try prometheus-operator:
● https://github.com/coreos/prometheus-operator
● https://github.com/helm/charts/tree/master/stable/prometheus-operator
Potential problems:
● How to add custom alerts, dashboards and monitoring rules.
● Should we use multiple smaller instances or the big one?
● Where should it be deploying?
#3 Logging
Nothing new: EFK stack do the job but:
● In multi-tenant we should implement elasticsearch document level security:
https://opendistro.github.io/for-elasticsearch/
● Kubernetes logs are still plaintext, not structured.
● Logs unification
#4 Need more ;)
● Service Mesh
● Tracing
● Cross cluster communication
● Infrastructure testing
● Sidecars and init container
● ...
People and mindset
Nobody said it is easy ;)

Container orchestration and microservices world

  • 1.
    Container orchestration and microservicesworld Karol Chrapek the story about containers orchestration.
  • 2.
    Novomatic Technologies Poland ●R&D center for Novomatic ● was established in 1997 (20 years) ● more than 300 specialists ● focusing on high-tech gaming technologies and entertainment market ● more info here: novomatic-tech.com
  • 3.
    Why do weneed containers in NTP? ● Unified deployment method. ● Accelerate software development, deployment and shipping processes. ● Simplify cooperation with different teams / companies inside the Novomatic group. ● Reduce the need to maintain dev infrastructure in each project. ● Solve problem with some legacy library and hardware.
  • 4.
    Container evolution inNTP ● “Think tank team” experiment with container: ○ speed up CI phase ○ simplify deployment and upgrade processes ○ run them everywhere (local test ;)) ● TTT created “Container’s Evelen” and showed a few presentations internally. ● More teams decided to use containers for test purposes. ● A few small projects start using docker in production. ● We needed solution for containers’ platform at scale. ● TTT deployed a first Kubernetes dev custer in NTP. ● “DevOps team” took responsibility for K8S stacks. ● DOT created a new clusters inside NTP. @Hefzul Bari
  • 5.
    Developers needs? ● Easyto run and share with other teams. ● Reduce number of issues forwarded to infrastructure team. ● One orchestration method/tool for local and production environments. ● A platform ready for public clouds. ● Support of legacy apps and their dependencies. ● Learn something new.
  • 6.
    Business needs? ● Reducingdeployment and scalability windows. ● Run on both classes of hardware: commodity and enterprise. ● The same deployment model for different environments and teams. ● Reducing performance degradation window during an failure. ● All new products should increase environment stability. ● Most of our clients require on-premise solution.
  • 7.
    Why did wechose kubernetes? ● We tested different tools and we choose one that suits “best” to our model. ● Currently k8s is container orchestration “standard”. ● All main cloud providers are compatible with kubernetes (GKE, AKS, EKS). ● Some clients own on premise Kubernetes infra, some teams prefer cloud providers but software deployment method stays the same. ● Approved by development teams and clients. ● Open source software.
  • 8.
    Development environments ● previous:one k8s cluster provisioned via custom bash scripts ● now: three two k8s clusters provisioned via Kubespray ○ 8-10 nodes ○ all nodes are virtual machines on Cisco stack ● some developers use Minikube ● sometimes additional test envs are exposed by our clients
  • 9.
    PaaS - requirements Operations: ●multi-datacenter ● high availability ● easy to provisioning ● on demand scalability ● security Developers: ● config management ● secret management ● service discovery ● blue-green deployment ● tracing Both: ● telemetry ● logging ● self-healing ● rolling update @Damien Pollet - flickr
  • 10.
  • 11.
    #1 Kubernetes isa distributed platform
  • 12.
  • 13.
    #2 Kubernetes asa PaaS core
  • 14.
    #2.1 Kubernetes asa PaaS core Platform [1]: - Distribution (55) - Hosted (34) - Installer (18) Others: - Application definition & Image Build [2] - Service Proxy [3] - Service Mesh [4] - Network [5] - Security [6] - Observability [7] - Storage [8] 1 7 2 3 4 5 6 8
  • 15.
    #3 Kubernetes -cutting edge vs prod grade API components (1.14) Version CronJob v1beta1 Ingress v1beta1 PodSecurityPolicy v1beta1 CSI Driver v1beta1
  • 16.
    #4 Etcd -replication and consistency Problems: ● Etcd size sometimes starts growing and grows … [#8009] ● Network glitch reducing etcd cluster availability seriously [#7321] ● Test clientv3 balancer under network partitions, other failures [#8711] @jevans
  • 17.
    #5 Kubernetes API ●CoreDNS crash when API server down [#2629] ● CVE-2018-1002105 [#71411] ● When API server down operators and some sidecars /init containers could crash (always HA) ● Kubernetes scheduler and controller crash when they are connected to localhost [#22846 and #77764 ] @jevans
  • 18.
    #6 Small deploymentand edge computing ● edge computing at Chick-fil-A ● Services overhead ● Deployment and monitoring is not so easy. ● Challenge: Cross cluster connections.
  • 19.
    #7 Enforcing defaultlimits for containers >Ja [2:20 PM] ale widze ostatnio masz twardą rękę do podziałów zasobów po zespołach :) ja mysle ze w tym tygodniu poprawie te limity i konfiguracje …. bo mi trochę głupio, że z prostymi problemami się borykamy: >Kolega XYZ [3:23 PM] moja babcia zawsze mówiła, że głupio to jest kraść
  • 20.
    #8 Run statefulapps https://twitter.com/kelseyhightower/status/963413508300812295
  • 21.
    #9 Operator helpsto manage STS but: ● they are complex, ● mostly support 60-80% of all maintaining tasks, ● chose manage services in cloud or classic orchestration for on-premise solution, ● sometimes sts apps version bump required manual operations.
  • 22.
    #10 Persistence volumesand k8s on-premise ● NFS - replication is tricky ● Rook operator [ceph or edgeFS] - complex ● Local volume still in Beta https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/ ● Expanding Persistent Volumes Claims still in beta ● Flexvolume and CSI driver
  • 23.
    #11 App flapping->connection reset via ingress Symptoms: Active connections reset after 5 minutes. Root cause: 1. Pod rescheduled (container OOM), new pod == new IP. 2. Service add new endpoint -> nginx configuration reload . 3. Nginx conf reload -> wait 5 minutes (worker-shutdown-timeout) and kill old worker. Related issue:#2461 nginx.com
  • 24.
    #12 Multitenant andRBAC ● Single tenant and multiple clusters or one multi-tenant cluster. ● Universal permission by resource type. ● No field-level access control.
  • 25.
    #13 Namespace -resource isolation ;) https://xkcd.com/2044/
  • 26.
    #14 Network Policy Bydefault network is “flat” inside Kubernetes ;) Common network policies: https://github.com/ahmetb/kubernetes-network-policy-recipes kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: namespace: secondary name: deny-from-other-namespaces spec: podSelector: matchLabels: ingress: - from: - podSelector: {}
  • 27.
  • 28.
  • 29.
    #1 Application deployment Happyhelming: ● The syntax in hard, especially when you start. ● Secret storing required extra plugin. [helm-secrets] ● Umbrella charts are always tricky. [#4490] ● Helm upgrade failed when new objects added [#4871] ● Tiller and RBAC [Tiller was removed from Helm3, discussion here #1918]
  • 30.
    #2 Telemetry If youlike a new and fancy solution try prometheus-operator: ● https://github.com/coreos/prometheus-operator ● https://github.com/helm/charts/tree/master/stable/prometheus-operator Potential problems: ● How to add custom alerts, dashboards and monitoring rules. ● Should we use multiple smaller instances or the big one? ● Where should it be deploying?
  • 31.
    #3 Logging Nothing new:EFK stack do the job but: ● In multi-tenant we should implement elasticsearch document level security: https://opendistro.github.io/for-elasticsearch/ ● Kubernetes logs are still plaintext, not structured. ● Logs unification
  • 32.
    #4 Need more;) ● Service Mesh ● Tracing ● Cross cluster communication ● Infrastructure testing ● Sidecars and init container ● ...
  • 33.
  • 35.
    Nobody said itis easy ;)