In a tool-heavy infrastructure world, this talk explains how to rethink DevOps as being a Contract first, instead of focusing on tools or teams or roles.We will cover 5 different infrastructure areas, how each of them were treated before as a tool, what bottlenecks they faced, how they were remodeled to a being a contract, and how the whole area scaled up. Every organization that wants to scale their Infrastructure up will highly relate to the problems and solutions outlined in these examples.
https://www.devopsdays.org/events/2018-berlin/program/subhas-dandapani/
2. - Travel ticket search and booking platform
- 15 countries
- 100k+ destinations
- 800+ partners and providers
- 20m+ monthly visitors
GoEuro
3. 50 to 150+ engineers
10 to 300+ services
How many changes can
we put in the user’s
hands in a week?
Centralization
bottlenecks
“Throw over the wall”
from dev to qa to ops
Exponential growth
Huge infrastructure QoS
requirements
Making best use of
available resources
Doing changes fast
Scaling without breaking
Log/metrics explosion
Challenges - circa 2017
Infrastructure Architecture
All forms of integration:
REST, SOAP, gRPC,
Event-driven, DB-driven,
File-driven,
Metrics-driven
Scala, Java, NodeJS,
Golang, Python, ...
Geo, Routing, Scheduling,
Ticketing, Big Data, ...
Delivery / DevOps
7. CI as a Tool
- Before: Jenkins-as-a-Tool
- Huge, complex CI jobs
- Partially configured by application teams and DevOps, no clear boundaries
- Several dedicated release managers who needed to maintain a "mind map" of
releases
- Engineers having to ping DevOps teams for releases
- Many CI plugins were installed for different teams
- Every Jenkins or Jenkins-plugin upgrade broke random jobs
- Agent configuration, auto-scaling, and job execution were also problematic
- Tried JenkinsFile - but still, Jenkins as a tool!
- Lots of copy-pasted config, shared functions, inability to change things as a whole from
outside, code injection?, etc.
8. Jenkinsfile
- Inability to instrument jobs as a whole and add global shared behavior
- Cannot parse code and modify AST tree
- Lots of copy-pasted config, shared functions
- Inability to do continuous changes on those functions
- Inability to prevent tie-in with internal plugins
- It’s still Jenkins-as-a-tool
9. - Should be semantically understandable and instrumentable
- Adopt a job definition contract
- Dots (Isolated Jobs), Pipelines (Lists), or Graphs, just pick one and adopt a YAML contract
CI as a Contract
11. - Specify pipeline contract with container image, scripts, checkpoints, etc.
- We take care of the implementation
- Team adds everything else on top of this file
- Build notifications, Caching, Agent allocation, Autoscaling, Analytics,
Organizational context, Auditing, etc.
CI as a Contract
13. - Minimal stack that we needed for every service
- Artifact (JAR/WAR/Docker image/etc)
- Service (shell script/initscript/systemd unit/docker container/etc)
- + Supporting services (watchdogs, ancillary utilities, etc)
- Configuration for different environments
- Multiple Instances of the stack
- Connected to traffic (networking, firewall, load balancer)
- Stack as a unit
- Let’s worry about post-deployment activities next
Configuration Management
14. - Fulfilling the minimal package
- Distribute chef cookbooks + librarian chef + chef apply
- Or ansible playbooks + ansible galaxy + ansible apply
- Or puppet/salt/...
- Distribute terraform with strict credentials + terraform apply
- Distribute <X> container orchestrator + apply
- Distribute Kubernetes resources + kubectl apply
- Core difference between these tools?
Configuration Management
15. Kubernetes as a Contract
Instance = Podspec
Running unit = Container spec
Multiple Instances = Deployment spec
Configuration = ConfigMap spec
Traffic = Service, Ingress spec
All modeled as JSON or YAML, but has
a standard contract/spec
kind: Deployment
metadata:
name: {{ .Values.name }}
namespace: {{ .Values.name }}
spec:
replicas: 5
… … …
containers:
- name: my-app
image: my-container:latest
resources:
requests:
cpu: …
memory: …
limits:
cpu: …
memory: …
livenessProbe:
httpGet:
path: /_system/health
env: ...
16. Application Stack
Kubernetes API Server
Open, Secure HTTPS
Protocol
Kubernetes responds to fulfil
what has been applied
Artifact = Docker image
Instance = Pod
Running unit = Container
Multiple Instances = Deployment
Configuration = ConfigMap
Traffic = Service, Ingress
All modeled as JSON or YAML, but
has a standard Resource spec
kubectl apply
18. Kubernetes as a Contract
- Health checks must exist
- CPU/memory must exist
- Images must not be external
- Entrypoint is for us, Script is for
app
- No alpha/beta stuff
- Whitelisted resources, separate
stateful & stateless clusters
- Minimum 2 replicas
- Similar for all resources
kind: Deployment
metadata:
name: {{ .Values.name }}
namespace: {{ .Values.name }}
spec:
replicas: 5
… … …
containers:
- name: my-app
image: my-container:latest
resources:
requests:
cpu: …
memory: …
limits:
cpu: …
memory: …
livenessProbe:
httpGet:
path: /_system/health
env: ...
19. API Proxy
Validation, Linting, Org-wide
standards, etc.
Cloud Resources as a Contract
Model your own contract
e.g. Cloud Bucket as a YAML
kubectl apply
Kubernetes API / Custom
controllers
20. Application Stack
- Using Kubernetes since 1.2 and on 1.10 right now
- Avoiding kubernetes API maze
- “src” and “ops” in every repository, completely self-contained
- Kubernetes upgrades go exactly as planned as we know what workloads are
running, and how to orchestrate/change the workloads
- Multiple features and standards rolled out to everyone who uses kubernetes
clusters
- Stateful and Stateless clusters separate from Day 1
- Not using kubernetes/helm as a tool, but as a contract
22. Logging as a Tool
- Logstash as a tool
- Hundreds of custom logstash transformations, pipelines, ports
- Snowflake configs for different fields for each service
- Slow, ticketed bootstrap process for new services
- Explosion of indices
- Multiple different ways to push logs from applications
23. Logging as a Contract
- Print logs on STDOUT, and they will come up on Kibana
- If it’s JSON, you get structured fields
- If it’s plain, you get plain message
- Standard enrichment rules to avoid stepping on each other’s toes:
- <field>_i = integer, <field>_s = string, <field>_geo = geo with lat/long, <field>_txt = text, etc.
- No application-specific code in logstash anymore
- Everything else taken care by team
- Scaling, Rotation, Retention, etc.
24. Routing as a Tool
- Started with one router handling all traffic, hardcoded service discovery
- As more services grew, more custom and inconsistent routing rules
- Nested rewrites and redirects, and randomly captured URL paths
- Most services had to carry custom nginx forwarders
- Proxies, then proxies-inside-proxies, etc.
- Unmanageable routing graph
- Complicated procedure to setup a new application
- Monitoring/logs was a different problem altogether
25. - Adopted Ingress as a Contract
- Every service was assigned fixed route based on namespace
- Consistent and predictable routing
- Team adds value on top of the contract
- Automatic logging and monitoring
- Global health and SLA checks
- Extensive instrumentation and tracing
- Scalability
- Load balancing
- Edge gateways
- Cross-zone failovers
- Dashboards and network policies
Routing as a Contract
26. Monitoring as a Contract
- Only tool-driven area in organization
- Prometheus as a Contract is great
- But we have some way to go (clustering, sharding, etc)
27. - All containers auto-injected with secrets
- Non-org images blacklisted in the API proxy
- GDPR as a contract
- ...
- In all the cases, tool is secondary, contract is discussed and agreed upon first
Security as a Contract
30. Lessons
- DevOps Team is another core engineering team providing services that
applications integrate with
- We provide contracts, and service implementations that fulfil that contract
- Have time to innovate and add value on top instead of handling tickets
- Wherever we added heavy tests around the contract with mock applications, infrastructure
quality went up
- Wherever possible, we dogfood the contract to ourselves
- Discuss hard before agreeing on a contract, and then go deliver
- Keep it simple/small, semantic, instrumentable and usable from dev machines to prod
- Adopt and reduce API surface of a mature industry contract wherever possible to avoid
re-design from scratch
- Not universally applicable, implementation of tool still matters
- Continuously upgrade infrastructure and good UX for engineers