Pierre Mavro – Staff SRE Lead - @deimosfr
Meetup S.R.E Tech Talk – Criteo Paris 05/28/2019
Kubernetes at
NoSQL (Team)
2 •
🎂 +4y Criteo – Staff SRE Lead
BirdSight.co – Co-Founder
📖 Author of MariaDB High Performance
Distributed software & infra lover
deimosfr
pmavro
Pierre Mavro
3 •
The mission statement of the team is to provide NoSQL databases and support solution to
answer several kind of requests:
• Volatile Key/Value cache
• Key/value cache with persistence and replication
• Highly available, very scalable and persisted storage
• Scalable full text search storage
We take care of non relational storage systems, including:
• Memcached
• Couchbase
• Cassandra
• Elasticsearch
NoSQL Team: Mission
4 •
Other numbers
• Team members: 4
• Number of Datacenters: 7
• Total servers: +4500
• Max Couchbase QPS: 5.8M
• Max Memcached QPS: 87M
• Avg Couchbase + Memcached total RAM used: +600TB
201 83
2408
1971
Number of servers per techno
Cassandra ElasticSearch Memcached Couchbase
NoSQL Team: world wide numbers
Where we started
from
Where we are now What we learned
Why we moved to Kubernetes for stateful apps
1. Where we started from
7 •
• We’re managing +200 logical clusters
• Pets VS Cattle: treating statefull apps as cattle is hard
• Hardware failures matches up to 5% of the total nodes (40k nodes)
• On NoSQL perimeter: 11%
• Our needs and challenges:
• High availability
• High scaling
• Low latency
• Self service
Context: management statefull apps at NoSQL
8 •
Context: configuration management
All the stack is managed by Chef:
Pros
Automation everywhere
Single entrypoint for everything
Teams are responsible of their coobooks layer
Cons
Software too tight to 1 machine
Hardware lifecycle = software lifecycle
Application
OS
Hardware
Chefmanagement
9 •
• Commission & decommission
• SLA protection during maintenance
• Application cluster lifecycle is complex to achieve when software is too tight to hardware
• Maintenance requires manual interventions
• Too much time spent on recuring tasks
Scaling issues
2. Where we are now
11 •
Solution: use a scheduler for application management
The application is not managed by Chef anymore
Pros
Automation still everywhere
Hardware lifecycle != software lifecycle
Reduced hard link between hardware and soft
Better SLO and SLA
Cons
Added a new element in the stack
Application
Kubernetes
OS
Hardware
Chefmanagement
12 •
• Chef: deploys Kubernetes
• Split Kubernetes clusters per techno (couchbase, memcached…)
• Consul: we’re running +35k agents -> consul-k8s (thx Hashicorp)
• Prometheus: operator to integrate our Prometheus federation
• Fluentd operator to forward logs to Elasticsearch
• No network overlay for production workload
• Local storage for I/O performance
Integration to our existing stack
3. Lessons learned
14 •
Questions to answer
• Should I use Deployment or statefulset?
• Requests/Limits
• Collocation?
Statefulset?
15 •
What should you use?
• Need of named instances?
• Need graceful rolling restart?
• LivenessProbe + ReadinessProbe
• Hooks: PostStart + PreStop
• TerminationGracePeriodSeconds not too low
• Canary? (Partition)
• HostPath or Local Volume?
• Toleration: unreachable + not-ready
• What kind of node downtime my application can allow?
Statefulset useful parameters
16 •
Know how your
application works and
behavior with expected
workload
Determine if Statefulset
or Deployment better
suits your needs
Create an Operator to
manage all manual
operation (if needed)
Run it as a Statefulset to
understand complexity in
Kubernetes and adapt if
needed
Don’t overcomplex things
Workflow to avoid failure
Thank you!
Q&A session

Criteo meetup - S.R.E Tech Talk

  • 1.
    Pierre Mavro –Staff SRE Lead - @deimosfr Meetup S.R.E Tech Talk – Criteo Paris 05/28/2019 Kubernetes at NoSQL (Team)
  • 2.
    2 • 🎂 +4yCriteo – Staff SRE Lead BirdSight.co – Co-Founder 📖 Author of MariaDB High Performance Distributed software & infra lover deimosfr pmavro Pierre Mavro
  • 3.
    3 • The missionstatement of the team is to provide NoSQL databases and support solution to answer several kind of requests: • Volatile Key/Value cache • Key/value cache with persistence and replication • Highly available, very scalable and persisted storage • Scalable full text search storage We take care of non relational storage systems, including: • Memcached • Couchbase • Cassandra • Elasticsearch NoSQL Team: Mission
  • 4.
    4 • Other numbers •Team members: 4 • Number of Datacenters: 7 • Total servers: +4500 • Max Couchbase QPS: 5.8M • Max Memcached QPS: 87M • Avg Couchbase + Memcached total RAM used: +600TB 201 83 2408 1971 Number of servers per techno Cassandra ElasticSearch Memcached Couchbase NoSQL Team: world wide numbers
  • 5.
    Where we started from Wherewe are now What we learned Why we moved to Kubernetes for stateful apps
  • 6.
    1. Where westarted from
  • 7.
    7 • • We’remanaging +200 logical clusters • Pets VS Cattle: treating statefull apps as cattle is hard • Hardware failures matches up to 5% of the total nodes (40k nodes) • On NoSQL perimeter: 11% • Our needs and challenges: • High availability • High scaling • Low latency • Self service Context: management statefull apps at NoSQL
  • 8.
    8 • Context: configurationmanagement All the stack is managed by Chef: Pros Automation everywhere Single entrypoint for everything Teams are responsible of their coobooks layer Cons Software too tight to 1 machine Hardware lifecycle = software lifecycle Application OS Hardware Chefmanagement
  • 9.
    9 • • Commission& decommission • SLA protection during maintenance • Application cluster lifecycle is complex to achieve when software is too tight to hardware • Maintenance requires manual interventions • Too much time spent on recuring tasks Scaling issues
  • 10.
    2. Where weare now
  • 11.
    11 • Solution: usea scheduler for application management The application is not managed by Chef anymore Pros Automation still everywhere Hardware lifecycle != software lifecycle Reduced hard link between hardware and soft Better SLO and SLA Cons Added a new element in the stack Application Kubernetes OS Hardware Chefmanagement
  • 12.
    12 • • Chef:deploys Kubernetes • Split Kubernetes clusters per techno (couchbase, memcached…) • Consul: we’re running +35k agents -> consul-k8s (thx Hashicorp) • Prometheus: operator to integrate our Prometheus federation • Fluentd operator to forward logs to Elasticsearch • No network overlay for production workload • Local storage for I/O performance Integration to our existing stack
  • 13.
  • 14.
    14 • Questions toanswer • Should I use Deployment or statefulset? • Requests/Limits • Collocation? Statefulset?
  • 15.
    15 • What shouldyou use? • Need of named instances? • Need graceful rolling restart? • LivenessProbe + ReadinessProbe • Hooks: PostStart + PreStop • TerminationGracePeriodSeconds not too low • Canary? (Partition) • HostPath or Local Volume? • Toleration: unreachable + not-ready • What kind of node downtime my application can allow? Statefulset useful parameters
  • 16.
    16 • Know howyour application works and behavior with expected workload Determine if Statefulset or Deployment better suits your needs Create an Operator to manage all manual operation (if needed) Run it as a Statefulset to understand complexity in Kubernetes and adapt if needed Don’t overcomplex things Workflow to avoid failure
  • 17.