Criteo meetup - S.R.E Tech Talk

Pierre Mavro – Staff SRE Lead - @deimosfr
Meetup S.R.E Tech Talk – Criteo Paris 05/28/2019
Kubernetes at
NoSQL (Team)

2 •
🎂 +4y Criteo – Staff SRE Lead
BirdSight.co – Co-Founder
📖 Author of MariaDB High Performance
Distributed software & infra lover
deimosfr
pmavro
Pierre Mavro

3 •
The mission statement of the team is to provide NoSQL databases and support solution to
answer several kind of requests:
• Volatile Key/Value cache
• Key/value cache with persistence and replication
• Highly available, very scalable and persisted storage
• Scalable full text search storage
We take care of non relational storage systems, including:
• Memcached
• Couchbase
• Cassandra
• Elasticsearch
NoSQL Team: Mission

4 •
Other numbers
• Team members: 4
• Number of Datacenters: 7
• Total servers: +4500
• Max Couchbase QPS: 5.8M
• Max Memcached QPS: 87M
• Avg Couchbase + Memcached total RAM used: +600TB
201 83
2408
1971
Number of servers per techno
Cassandra ElasticSearch Memcached Couchbase
NoSQL Team: world wide numbers

Where we started
from
Where we are now What we learned
Why we moved to Kubernetes for stateful apps

7 •
• We’re managing +200 logical clusters
• Pets VS Cattle: treating statefull apps as cattle is hard
• Hardware failures matches up to 5% of the total nodes (40k nodes)
• On NoSQL perimeter: 11%
• Our needs and challenges:
• High availability
• High scaling
• Low latency
• Self service
Context: management statefull apps at NoSQL

8 •
Context: configuration management
All the stack is managed by Chef:
Pros
Automation everywhere
Single entrypoint for everything
Teams are responsible of their coobooks layer
Cons
Software too tight to 1 machine
Hardware lifecycle = software lifecycle
Application
OS
Hardware
Chefmanagement

9 •
• Commission & decommission
• SLA protection during maintenance
• Application cluster lifecycle is complex to achieve when software is too tight to hardware
• Maintenance requires manual interventions
• Too much time spent on recuring tasks
Scaling issues

11 •
Solution: use a scheduler for application management
The application is not managed by Chef anymore
Pros
Automation still everywhere
Hardware lifecycle != software lifecycle
Reduced hard link between hardware and soft
Better SLO and SLA
Cons
Added a new element in the stack
Application
Kubernetes
OS
Hardware
Chefmanagement

12 •
• Chef: deploys Kubernetes
• Split Kubernetes clusters per techno (couchbase, memcached…)
• Consul: we’re running +35k agents -> consul-k8s (thx Hashicorp)
• Prometheus: operator to integrate our Prometheus federation
• Fluentd operator to forward logs to Elasticsearch
• No network overlay for production workload
• Local storage for I/O performance
Integration to our existing stack

14 •
Questions to answer
• Should I use Deployment or statefulset?
• Requests/Limits
• Collocation?
Statefulset?

15 •
What should you use?
• Need of named instances?
• Need graceful rolling restart?
• LivenessProbe + ReadinessProbe
• Hooks: PostStart + PreStop
• TerminationGracePeriodSeconds not too low
• Canary? (Partition)
• HostPath or Local Volume?
• Toleration: unreachable + not-ready
• What kind of node downtime my application can allow?
Statefulset useful parameters

16 •
Know how your
application works and
behavior with expected
workload
Determine if Statefulset
or Deployment better
suits your needs
Create an Operator to
manage all manual
operation (if needed)
Run it as a Statefulset to
understand complexity in
Kubernetes and adapt if
needed
Don’t overcomplex things
Workflow to avoid failure

Criteo meetup - S.R.E Tech Talk

More Related Content

What's hot

Similar to Criteo meetup - S.R.E Tech Talk

More from Pierre Mavro

Recently uploaded

Criteo meetup - S.R.E Tech Talk