C’est qui c’gars là?
Samuel Bégin
Cloud systems developer
Mon parcours
Developpeur
full stack
Introduced
to docker
and swarm
First
production
K8S cluster
Engagé chez
Coveo
Cloud infrastructure team
Maxime Poirier-Journeault Jérome Heil
Cloud system specialist Cloud system developer
Alexandre Rousseau
Cloud system developer
Dave Robitaille
Cloud system specialist
Martin Ouellet
Cloud infra team lead
Where we’re at
● Kubernetes is the main compute
engine
● 95% of the workload runs on Linux
● Everything is deployed autonomously
5
12 K8S clusters
8 Production clusters
12500 Pods
370 Nodes
7 Regions worldwide
How to keep track
of all those pieces?
Thanos and
Prometheus
Dev
Prometheus
• Defied the gods by stealing fire from them
and gave it to humanity in the form of
technology and knowledge
• Generally seen as the author of the human
arts and sciences
Prometheus Alone, only goes so far
- Good for the first few millions time series
- Not so redundant
- Costly to have a long retention time
9
Thanos
(Thanatos)
• God of non-violent deaths
• Merciless and indiscriminate, hated by
mortals and gods alike
• Also a reference to the “snap of
disintegration” in Avengers
High Availability
- Double prometheus instances
- Exact same configuration
- anti-affinity and pod-topology-constraints
- pod-disruption-budget
Sharding
- Split scraping configuration
- Exact same configuration
12
13
Grafana and the global Thanos Querier
14
The lifetime of a time series
1
5
Your software exposes a metric
There are a lot of pre-made exporters,
but we also make our own
• Default exporters from kubernetes
• Node exporter in the base AMI
• Custom exporters for Coveo software
The metric is scraped
And enters the time series database
• Uses ServiceMonitors
• Most metrics are served on endpoint /metrics
• Accessible through the Sidecar
Groups of 3 hours are
uploaded to S3
Prometheus makes a chunk out of 3
hours of data. One chunk is made for
each time series.
Thanos sidecar will upload chunks to
aws S3
Downsampling
Thanos compactor will downsample our
time series. This helps with storage and
query speed over large timeranges.
Max retention
We configure various retention times for
time series in Thanos.
21
22
Grafana and the
global Thanos Frontend
23
Comment
profiter de
toute cette
information?
Grafana
Alertmanager
26
Autoscalers
27
What’s next?
Thank you!
2
9
29
CNCF Meetup Part 1_ Thanos and Prometheus KT.pdf

CNCF Meetup Part 1_ Thanos and Prometheus KT.pdf