Kubernetes @ Squarespace: Kubernetes in the Datacenter

Kubernetes @ Squarespace
Microservices on Kubernetes in a Datacenter
Kevin Lynch
klynch@squarespace.com

Agenda
01 The problem with static infrastructure
02 Kubernetes Fundamentals
03 Adapting Microservices to Kubernetes
04 Kubernetes in a datacenter?

Microservices Journey: A Story of Growth
2013: small (< 50 engineers)
build product & grow customer base
whatever works
2014: medium (< 100 engineers)
we have a lot of customers now!
whatever works doesn't work anymore
2016: large (100+ engineers)
architect for scalability and reliability
organizational structures
?: XL (200+ engineers)

Challenges with a Monolith
● Reliability
● Performance
● Engineering agility/speed, cross-team coupling
● Engineering time spent fire fighting rather than building new
functionality
What were the increasingly difficult challenges with a
monolith?

Challenges with a Monolith
● Minimize failure domains
● Developers are more confident in their changes
● Squarespace can move faster
Solution: Microservices!

Operational Challenges
● Engineering org grows…
● More features...
● More services…
● More infrastructure to spin up…
● Ops becomes a blocker...
Stuck in a loop

Traditional Provisioning Process
● Pick ESX with available resources
● Pick IP
● Register host to Cobbler
● Register DNS entry
● Create new VM on ESX
● PXE boot VM and install OS and base configuration
● Install system dependencies (LDAP, NTP, CollectD, Sensu…)
● Install app dependencies (Java, FluentD/Filebeat, Consul, Mongo-
S…)
● Install the app
● App registers with discovery system and begins receiving traffic

Containerization & Kubernetes Orchestration
● Difficult to find resources
● Slow to provision and scale
● Discovery is a must
● Metrics system must support short lived metrics
● Alerts are usually per instance
Static infrastructure and microservices do not mix!

Kubernetes Provisioning Process
● kubectl apply -f app.yaml

Kubernetes Fundamentals
● ApiVersion & Kind
○ type of object
● Metadata
○ Names, annotations, labels
● Spec & Status
○ What you want to happen...
○ … versus reality
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: default
annotations:
squarespace.net/build: nginx-42
labels:
app: frontend
...
spec:
containers:
- name: nginx
image: nginx:latest
...
status:
hostIP: 10.122.1.201
podIP: 10.123.185.9
phase: Running
qosClass: BestEffort
startTime: 2017-07-31T02:08:25Z
...

Kubernetes Fundamentals
● Labels
○ KV pairs used for identification
○ Indexed for efficient querying
● Annotations
○ Non identifying information
○ Can be unstructured
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: default
annotations:
squarespace.net/build: nginx-42
labels:
app: frontend
...
spec:
containers:
- name: nginx
image: nginx:1.8.1
...
status:
hostIP: 10.122.1.201
podIP: 10.123.185.9
phase: Running
qosClass: BestEffort
startTime: 2017-07-31T02:08:25Z
...

Common Objects: Pods
● Basic deployable workload
● Group of 1+ containers
● Define resource requirements
● Defines storage volumes
○ Ephemeral storage
○ Shared storage (NFS, CephFS)
○ Block storage (RBD)
○ Secrets
○ ConfigMaps
○ more...
spec:
containers:
- name: location
image: .../location:master-269
ports: ...
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 2
memory: 4Gi
volumeMounts:
- name: config
mountPath: /service/config
- name: log-dir
mountPath: /data/logs
volumes:
- name: config
configMap:
name: location-config
- name: log-dir
emptyDir: {}

Common Objects: Deployments
● Declarative
● Defines a type of pod to run
● Defines desired #
● Supports basic operations
○ Can be rolled back quickly!
○ Can be scaled up/down
● Meant to be stateless apps!
kind: Deployment
spec:
replicas: 3
selector:
matchLabels:
service: location
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
... pod info here ...

Common Objects: Services
● Make pods addressable
● Assigned an IP
● Addressable DNS entries!
apiVersion: v1
kind: Service
metadata:
name: location
namespace: core-services
spec:
type: ClusterIP
clusterIP: 10.123.79.211
selector:
service: location
ports:
- name: traffic
port: 8080
- name: admin
port: 8081

Common Objects: Namespaces
● Namespaces
○ Isolates groups of objects
■ Developer
■ Team
■ System or Service
○ Good for permission boundaries
○ Good for network boundaries
● Most objects are namespaced
apiVersion: v1
kind: Namespace
metadata:
name: core-services
annotations:
squarespace.net/contact: |
team@squarespace.com
creationTimestamp: 2017-06-14T..
spec:
finalizers:
- kubernetes
status:
phase: Active

Microservice Pod Definition
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
Microservice Pod
Java Microservice
fluentd consul

Future Work: Updating Common Dependencies
● Custom Initializers
○ Inject container dependencies into deployments (consul, fluentd)
○ Configure Prometheus instances for each namespace
● Trigger rescheduling of pods when dependencies need updating
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: location
namespace: core-services
annotations:
initializer.squarespace.net/consul: "true"

Future Work: Enforce Squarespace Standards
● Custom Admission Controller requires all services, deployments, etc.
meet certain standards
○ Resource requests/limits
○ Owner annotations
○ Service labels

Quality of Service Classes
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● BestEffort
○ No resource constraints
○ First to be killed under pressure
● Guaranteed
○ Requests == Limits
○ Last to kill under pressure
○ Easier to reason about resources
● Burstable
○ Take advantage of unused resources!
○ Can be tricky with some languages

resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● Kubernetes assumes no other processes are
consuming significant resources
● Completely Fair Scheduler (CFS)
○ Schedules a task based on CPU Shares
○ Throttles a task once it hits CPU Quota
● OOM Killed when memory limit exceeded

resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● Shares = CPU Request * 1024
● Total Kubernetes Shares = # Cores * 1024
● Quota = CPU Limit * 100ms
● Period = 100ms

Java in a Container
● JVM is able to detect # of cores via sysconf(_SC_NPROCESSORS_ONLN)
● Many libraries rely on Runtime.getRuntime.availableProcessors()
○ Jetty
○ ForkJoinPool
○ GC Threads
○ That mystery dependency...

Java in a Container
● Provide a base container that calculates the container’s resources!
● Detect # of “cores” assigned
○ /sys/fs/cgroup/cpu/cpu.cfs_quota_us divided by
/sys/fs/cgroup/cpu/cpu.cfs_period_us
● Automatically tune the JVM:
○ -XX:ParallelGCThreads=${core_limit}
○ -XX:ConcGCThreads=${core_limit}
○ -Djava.util.concurrent.ForkJoinPool.common.parallelism=${core_limit}

Java in a Container
● Use Linux preloading to override availableProcessors()
#include <stdlib.h>
#include <unistd.h>
int JVM_ActiveProcessorCount(void) {
char* val = getenv("CONTAINER_CORE_LIMIT");
return val != NULL ? atoi(val) : sysconf(_SC_NPROCESSORS_ONLN);
}
https://engineering.squarespace.com/blog/2017/understanding-linux-container-scheduling

Service Lifecycle
● How do we observe the health of services?
● How do we handle rollbacks?

● Graphite does not scale well with ephemeral instances
● Easy to have combinatoric explosion of metrics
Traditional Monitoring & Alerting
● Application and system alerts are tightly coupled
● Difficult to create alerts on SLAs
● Difficult to route alerts

Traditional Monitoring & Alerting

Falco: Centralized Service Management
● Kubernetes Dashboard is too complex and powerful
● Centralized deployment status and history
● Manual rollbacks of deploys
● Quick access to scaling controls

Kubernetes Dashboard
http://kubernetes-dashboard.kube-system.svc.eqx.dal.prod.kubernetes/

Falco: Centralized Service Management

● Efficient for ephemeral instances
● Stores tagged data
● Easy to have many smaller instances (per team or complex system)
● Prometheus Operator runs everything in Kubernetes!
Kubernetes Monitoring & Alerting
● Alerts are defined with the application code!
● Easy to define SLA alerts
● Routing is still difficult

● Sensu checks for all core components
● Sent to PagerDuty

http://prometheus.kube-system.svc.eqx.dal.prod.kubernetes:9090/alerts

Spine and Leaf Layer 3 Clos Topology
● All work is performed at the leaf/ToR switch
● Each leaf switch is separate Layer 3 domain
● Each leaf is a separate BGP domain (ASN)
● No Spanning Tree Protocol issues seen in L2 networks (convergence
time, loops)
Leaf Leaf Leaf Leaf
Spine Spine

Spine and Leaf Layer 3 Clos Topology
● Simple to understand
● Easy to scale
● Predictable and consistent latency (hops = 2)
● Allows for Anycast IPs
Leaf Leaf Leaf Leaf
Spine Spine

Calico Networking
● No network overlay required!
○ No nasty MTU issues
○ No performance impact
● Communicates directly with existing L3 network
● BGP Peering with Top of Rack switch

Calico Networking
● Engineers can think of Pod IPs as normal hosts
(they’re not)
○ Ping works
○ Consul works normally
○ Browser communication works
○ Shell sorta works (kubectl exec -it pod sh)

Calico Networking
● Each worker announces it’s pod IP ranges
○ Aggregated to /26
● Each master announces an External Anycast IP
○ Used for component communication
● Each ingress tier announces the Service IP range
ip addr add 10.123.0.0/17 dev lo
etcdctl set
/calico/bgp/v1/global/custom_filters/v4/services
'if ( net = 10.123.0.0/17 ) then { accept; }'

Calico Networking: Firewalls
● Calico supports NetworkPolicy firewall rules
○ We aren’t using this yet!
● Add DefaultDeny to block traffic into namespace
● Add Ingress rules for whitelisted communication
○ Works across namespaces
○ Works with raw IP ranges

QUESTIONS?
Thank you!
squarespace.com/careers
Kevin Lynch
klynch@squarespace.com

Future Work: Security!
● PodSecurityPolicy
● Mutual TLS in our environment:
○ Kubernetes WG Components WG
○ SIG Auth
○ SPIFFE
○ ISTIO

How do I connect to the cluster?
● Look at Getting Started guide on the wiki
● Generate a kubeconfig file
○ curl --user $(whoami) https://kubeconfig-generator.squarespace.net
● Uses KeyCloak OIDC to authenticate users
● Automatically refreshes credentials!
● Audit Logs are sent to logs.squarespace.net

Squarespace Clusters
Dallas Production
8 nodes
512 cores
2 TB RAM
NJ Production
4 nodes
256 cores
1 TB RAM
Dallas Staging
6 nodes
384 cores
1.5 TB RAM
NJ Staging
Coming soon!
Dallas Corp
8 nodes
416 cores
1.5 TB RAM
NJ Corp
Coming soon!
Testbed
6 Mac Minis :-p

Kube-Proxy: Internal Networking
● Runs on every host
● Routes service IPs to pods
● Watches for changes
● Updates IPTables rules

Communication With External Services
● Environment specific services should not be encoded in application
● Single deployment for all environments and datacenters
● Federation API expects same deployment
● Not all applications are using consul

apiVersion: v1
kind: Service
metadata:
name: kafka
namespace: elk
spec:
type: ClusterIP
clusterIP: None
sessionAffinity: None
ports:
- port: 9092
protocol: TCP
targetPort: 9092
apiVersion: v1
kind: Endpoints
metadata:
name: kafka
namespace: elk
subsets:
- addresses:
- ip: 10.120.201.33
- ip: 10.120.201.34
- ip: 10.120.201.35
...
ports:
- port: 9092
protocol: TCP

Kubernetes @ Squarespace: Kubernetes in the Datacenter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kubernetes @ Squarespace: Kubernetes in the Datacenter

Similar to Kubernetes @ Squarespace: Kubernetes in the Datacenter (20)

Recently uploaded

Recently uploaded (20)

Kubernetes @ Squarespace: Kubernetes in the Datacenter

Editor's Notes