Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kubernetes @ Squarespace: Kubernetes in the Datacenter

451 views

Published on

This talk was presented at SRE NYC Meetup on August 16, 2017 at Squarespace HQ.
https://www.youtube.com/watch?v=UJ1QAKprVr4

As the engineering teams at Squarespace grow, we have been building more and more microservices. However, this has added operational strain as we try to shoehorn a growing, complex dynamic environment into our static data center infrastructure. We needed to rethink how we handle deployments, dependency management, resource allocation, monitoring, and alerting. Docker containerization and Kubernetes orchestration helps us tackle many of these problems, but the journey has been challenging. In this talk, we’ll discuss the challenges of running Kubernetes in a datacenter and how we switched to a more SLA-focused alert structure than per instance health with Prometheus and AlertManager.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Kubernetes @ Squarespace: Kubernetes in the Datacenter

  1. 1. Kubernetes @ Squarespace Microservices on Kubernetes in a Datacenter Kevin Lynch klynch@squarespace.com
  2. 2. Agenda 01 The problem with static infrastructure 02 Kubernetes Fundamentals 03 Adapting Microservices to Kubernetes 04 Kubernetes in a datacenter?
  3. 3. Microservices Journey: A Story of Growth 2013: small (< 50 engineers) build product & grow customer base whatever works 2014: medium (< 100 engineers) we have a lot of customers now! whatever works doesn't work anymore 2016: large (100+ engineers) architect for scalability and reliability organizational structures ?: XL (200+ engineers)
  4. 4. Challenges with a Monolith ● Reliability ● Performance ● Engineering agility/speed, cross-team coupling ● Engineering time spent fire fighting rather than building new functionality What were the increasingly difficult challenges with a monolith?
  5. 5. Challenges with a Monolith ● Minimize failure domains ● Developers are more confident in their changes ● Squarespace can move faster Solution: Microservices!
  6. 6. Operational Challenges ● Engineering org grows… ● More features... ● More services… ● More infrastructure to spin up… ● Ops becomes a blocker... Stuck in a loop
  7. 7. Traditional Provisioning Process ● Pick ESX with available resources ● Pick IP ● Register host to Cobbler ● Register DNS entry ● Create new VM on ESX ● PXE boot VM and install OS and base configuration ● Install system dependencies (LDAP, NTP, CollectD, Sensu…) ● Install app dependencies (Java, FluentD/Filebeat, Consul, Mongo- S…) ● Install the app ● App registers with discovery system and begins receiving traffic
  8. 8. Containerization & Kubernetes Orchestration ● Difficult to find resources ● Slow to provision and scale ● Discovery is a must ● Metrics system must support short lived metrics ● Alerts are usually per instance Static infrastructure and microservices do not mix!
  9. 9. Kubernetes Provisioning Process ● kubectl apply -f app.yaml
  10. 10. Kubernetes Fundamentals ● ApiVersion & Kind ○ type of object ● Metadata ○ Names, annotations, labels ● Spec & Status ○ What you want to happen... ○ … versus reality apiVersion: v1 kind: Pod metadata: name: nginx namespace: default annotations: squarespace.net/build: nginx-42 labels: app: frontend ... spec: containers: - name: nginx image: nginx:latest ... status: hostIP: 10.122.1.201 podIP: 10.123.185.9 phase: Running qosClass: BestEffort startTime: 2017-07-31T02:08:25Z ...
  11. 11. Kubernetes Fundamentals ● Labels ○ KV pairs used for identification ○ Indexed for efficient querying ● Annotations ○ Non identifying information ○ Can be unstructured apiVersion: v1 kind: Pod metadata: name: nginx namespace: default annotations: squarespace.net/build: nginx-42 labels: app: frontend ... spec: containers: - name: nginx image: nginx:1.8.1 ... status: hostIP: 10.122.1.201 podIP: 10.123.185.9 phase: Running qosClass: BestEffort startTime: 2017-07-31T02:08:25Z ...
  12. 12. Common Objects: Pods ● Basic deployable workload ● Group of 1+ containers ● Define resource requirements ● Defines storage volumes ○ Ephemeral storage ○ Shared storage (NFS, CephFS) ○ Block storage (RBD) ○ Secrets ○ ConfigMaps ○ more... spec: containers: - name: location image: .../location:master-269 ports: ... resources: limits: cpu: 2 memory: 4Gi requests: cpu: 2 memory: 4Gi volumeMounts: - name: config mountPath: /service/config - name: log-dir mountPath: /data/logs volumes: - name: config configMap: name: location-config - name: log-dir emptyDir: {}
  13. 13. Common Objects: Deployments ● Declarative ● Defines a type of pod to run ● Defines desired # ● Supports basic operations ○ Can be rolled back quickly! ○ Can be scaled up/down ● Meant to be stateless apps! kind: Deployment spec: replicas: 3 selector: matchLabels: service: location strategy: rollingUpdate: maxSurge: 100% maxUnavailable: 0 type: RollingUpdate template: ... pod info here ...
  14. 14. Common Objects: Services ● Make pods addressable ● Assigned an IP ● Addressable DNS entries! apiVersion: v1 kind: Service metadata: name: location namespace: core-services spec: type: ClusterIP clusterIP: 10.123.79.211 selector: service: location ports: - name: traffic port: 8080 - name: admin port: 8081
  15. 15. Common Objects: Namespaces ● Namespaces ○ Isolates groups of objects ■ Developer ■ Team ■ System or Service ○ Good for permission boundaries ○ Good for network boundaries ● Most objects are namespaced apiVersion: v1 kind: Namespace metadata: name: core-services annotations: squarespace.net/contact: | team@squarespace.com creationTimestamp: 2017-06-14T.. spec: finalizers: - kubernetes status: phase: Active
  16. 16. Microservice Pod Definition resources: requests: cpu: 2 memory: 4Gi limits: cpu: 2 memory: 4Gi Microservice Pod Java Microservice fluentd consul
  17. 17. Future Work: Updating Common Dependencies ● Custom Initializers ○ Inject container dependencies into deployments (consul, fluentd) ○ Configure Prometheus instances for each namespace ● Trigger rescheduling of pods when dependencies need updating apiVersion: extensions/v1beta1 kind: Deployment metadata: name: location namespace: core-services annotations: initializer.squarespace.net/consul: "true"
  18. 18. Future Work: Enforce Squarespace Standards ● Custom Admission Controller requires all services, deployments, etc. meet certain standards ○ Resource requests/limits ○ Owner annotations ○ Service labels
  19. 19. Quality of Service Classes resources: requests: cpu: 2 memory: 4Gi limits: cpu: 2 memory: 4Gi ● BestEffort ○ No resource constraints ○ First to be killed under pressure ● Guaranteed ○ Requests == Limits ○ Last to kill under pressure ○ Easier to reason about resources ● Burstable ○ Take advantage of unused resources! ○ Can be tricky with some languages
  20. 20. Microservice Pod Definition resources: requests: cpu: 2 memory: 4Gi limits: cpu: 2 memory: 4Gi ● Kubernetes assumes no other processes are consuming significant resources ● Completely Fair Scheduler (CFS) ○ Schedules a task based on CPU Shares ○ Throttles a task once it hits CPU Quota ● OOM Killed when memory limit exceeded
  21. 21. Microservice Pod Definition resources: requests: cpu: 2 memory: 4Gi limits: cpu: 2 memory: 4Gi ● Shares = CPU Request * 1024 ● Total Kubernetes Shares = # Cores * 1024 ● Quota = CPU Limit * 100ms ● Period = 100ms
  22. 22. Java in a Container ● JVM is able to detect # of cores via sysconf(_SC_NPROCESSORS_ONLN) ● Many libraries rely on Runtime.getRuntime.availableProcessors() ○ Jetty ○ ForkJoinPool ○ GC Threads ○ That mystery dependency...
  23. 23. Java in a Container ● Provide a base container that calculates the container’s resources! ● Detect # of “cores” assigned ○ /sys/fs/cgroup/cpu/cpu.cfs_quota_us divided by /sys/fs/cgroup/cpu/cpu.cfs_period_us ● Automatically tune the JVM: ○ -XX:ParallelGCThreads=${core_limit} ○ -XX:ConcGCThreads=${core_limit} ○ -Djava.util.concurrent.ForkJoinPool.common.parallelism=${core_limit}
  24. 24. Java in a Container ● Use Linux preloading to override availableProcessors() #include <stdlib.h> #include <unistd.h> int JVM_ActiveProcessorCount(void) { char* val = getenv("CONTAINER_CORE_LIMIT"); return val != NULL ? atoi(val) : sysconf(_SC_NPROCESSORS_ONLN); } https://engineering.squarespace.com/blog/2017/understanding-linux-container-scheduling
  25. 25. Service Lifecycle ● How do we observe the health of services? ● How do we handle rollbacks?
  26. 26. Monitoring
  27. 27. ● Graphite does not scale well with ephemeral instances ● Easy to have combinatoric explosion of metrics Traditional Monitoring & Alerting ● Application and system alerts are tightly coupled ● Difficult to create alerts on SLAs ● Difficult to route alerts
  28. 28. Traditional Monitoring & Alerting
  29. 29. Falco: Centralized Service Management ● Kubernetes Dashboard is too complex and powerful ● Centralized deployment status and history ● Manual rollbacks of deploys ● Quick access to scaling controls
  30. 30. Kubernetes Dashboard http://kubernetes-dashboard.kube-system.svc.eqx.dal.prod.kubernetes/
  31. 31. Falco: Centralized Service Management
  32. 32. ● Efficient for ephemeral instances ● Stores tagged data ● Easy to have many smaller instances (per team or complex system) ● Prometheus Operator runs everything in Kubernetes! Kubernetes Monitoring & Alerting ● Alerts are defined with the application code! ● Easy to define SLA alerts ● Routing is still difficult
  33. 33. Kubernetes Monitoring & Alerting
  34. 34. Prometheus Operator
  35. 35. Kubernetes Monitoring & Alerting ● Sensu checks for all core components ● Sent to PagerDuty
  36. 36. Kubernetes Monitoring & Alerting http://prometheus.kube-system.svc.eqx.dal.prod.kubernetes:9090/alerts
  37. 37. Kubernetes in a datacenter?
  38. 38. Kubernetes Architecture
  39. 39. Kubernetes Networking
  40. 40. Spine and Leaf Layer 3 Clos Topology ● All work is performed at the leaf/ToR switch ● Each leaf switch is separate Layer 3 domain ● Each leaf is a separate BGP domain (ASN) ● No Spanning Tree Protocol issues seen in L2 networks (convergence time, loops) Leaf Leaf Leaf Leaf Spine Spine
  41. 41. Spine and Leaf Layer 3 Clos Topology ● Simple to understand ● Easy to scale ● Predictable and consistent latency (hops = 2) ● Allows for Anycast IPs Leaf Leaf Leaf Leaf Spine Spine
  42. 42. Calico Networking ● No network overlay required! ○ No nasty MTU issues ○ No performance impact ● Communicates directly with existing L3 network ● BGP Peering with Top of Rack switch
  43. 43. Calico Networking ● Engineers can think of Pod IPs as normal hosts (they’re not) ○ Ping works ○ Consul works normally ○ Browser communication works ○ Shell sorta works (kubectl exec -it pod sh)
  44. 44. Calico Networking ● Each worker announces it’s pod IP ranges ○ Aggregated to /26 ● Each master announces an External Anycast IP ○ Used for component communication ● Each ingress tier announces the Service IP range ip addr add 10.123.0.0/17 dev lo etcdctl set /calico/bgp/v1/global/custom_filters/v4/services 'if ( net = 10.123.0.0/17 ) then { accept; }'
  45. 45. Calico Networking: Firewalls ● Calico supports NetworkPolicy firewall rules ○ We aren’t using this yet! ● Add DefaultDeny to block traffic into namespace ● Add Ingress rules for whitelisted communication ○ Works across namespaces ○ Works with raw IP ranges
  46. 46. QUESTIONS? Thank you! squarespace.com/careers Kevin Lynch klynch@squarespace.com
  47. 47. Future Work: Security! ● PodSecurityPolicy ● Mutual TLS in our environment: ○ Kubernetes WG Components WG ○ SIG Auth ○ SPIFFE ○ ISTIO
  48. 48. How do I connect to the cluster? ● Look at Getting Started guide on the wiki ● Generate a kubeconfig file ○ curl --user $(whoami) https://kubeconfig-generator.squarespace.net ● Uses KeyCloak OIDC to authenticate users ● Automatically refreshes credentials! ● Audit Logs are sent to logs.squarespace.net
  49. 49. Squarespace Clusters Dallas Production 8 nodes 512 cores 2 TB RAM NJ Production 4 nodes 256 cores 1 TB RAM Dallas Staging 6 nodes 384 cores 1.5 TB RAM NJ Staging Coming soon! Dallas Corp 8 nodes 416 cores 1.5 TB RAM NJ Corp Coming soon! Testbed 6 Mac Minis :-p
  50. 50. Kube-Proxy: Internal Networking ● Runs on every host ● Routes service IPs to pods ● Watches for changes ● Updates IPTables rules
  51. 51. Communication With External Services ● Environment specific services should not be encoded in application ● Single deployment for all environments and datacenters ● Federation API expects same deployment ● Not all applications are using consul
  52. 52. Communication With External Services
  53. 53. Communication With External Services apiVersion: v1 kind: Service metadata: name: kafka namespace: elk spec: type: ClusterIP clusterIP: None sessionAffinity: None ports: - port: 9092 protocol: TCP targetPort: 9092 apiVersion: v1 kind: Endpoints metadata: name: kafka namespace: elk subsets: - addresses: - ip: 10.120.201.33 - ip: 10.120.201.34 - ip: 10.120.201.35 ... ports: - port: 9092 protocol: TCP

×