CNCF Thanos @ Qonto

About Me
1
○ DevOps @ Qonto since 2018
○ Previously from Dailymotion and
BlaBlacar
○ Open Source enthusiast
○ 🏓 PingPong Player 🏓

○ Context
○ Single Point of View
○ Long Term Retention
○ Next Steps
○ Results
Summary
What are you going to discover
3

Qonto, neobank for SMEs and freelancers
4

○ All our services run on AWS in two distinct accounts
○ Kubernetes user with 6 clusters
○ Monitoring is ensured by Prometheus (with its operator) and Grafana
Context
What is the tech at Qonto
5

Problems
What leads to Thanos
6
○ How to have a single view on all theses clusters ?
○ How to keep this vision over time ?

First try
What I have done in 2015
Prometheus remote write
Reformat metrics into
Warp10 formats
Learn and deploy
Warp10 components
Learn and deploy Kafka
0.8.2.4
Learn and deploy HDFS
Rewrite all dashboards
to use WarpScript
Train all devs and ops
Maintain it
7

qonto.eu
Never used by anyone
6 months of work

What we really want
And dare asking
Seamless integration with
Prometheus and Grafana
Managed storage outside
Kubernetes
Easily modulable and
lightweight architecture
9

Why not Prometheus federation
Easy peasy, end of story?
HA Prometheus federation
Federated Prometheus
(one per kubernetes cluster)
10

Why not federation
Still the same problems
○ Which prometheus to query ?
○ At which interval the federated Prometheus should be scrapped
○ Storage is still handled by Kubernetes (more generally a finite hard drive)
11

Thanos.io
○ Open source, written in golang by Improbable
○ Based on Prometheus 2.0 engine and GRPC
○ Split into distinct components
○ Highly available Prometheus setup with long term storage capabilities
12

Thanos sidecar
What is a sidecar
A sidecar is a utility container in the Pod and its purpose is to support the main container.
In our case, the Thanos sidecar run alongside each Prometheus and :
● exposes a GRPC endpoint
● implements Thanos Store API
● access Prometheus chunks
15

Thanos sidecar
Integration
Prometheus operator config
Kubernetes deployment
16

Thanos sidecar
How to differentiate sidecars
context
qonto and cbs
environment
production, staging and infra
kafka_brokers{context="qonto", environment="staging", service="kafka-broker-exporter"} 3
kafka_brokers{context="qonto", environment="production", service="kafka-broker-exporter"} 5
Leverage on Prometheus external labels feature
18

Thanos query
Link sidecars together
○ Centralize, propagate PromQL to sidecars and merge results
○ Act as a Prometheus data source in ou Grafana
○ Stick to the same UI as Prometheus
○ Also implements Thanos StoreApi
19

Thanos query
Service discovery
○ Gossip protocol since the beginning, removed in v0.5.0
○ File service discovery
○ Static list of Thanos components that implements their storeApi
20

Thanos query
Integration
21

qonto.eu
NOPE
Did it work the first time?
component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error:
code = DeadlineExceeded desc = context deadline exceeded" address=thanos-sidecar.production.qonto.co

Fail #1
Communication between Query And Sidecars
HTTP 2.0 HTTP 1.1
26
✅ ❌

Solution #1
Use an Elastic Load Balancer
HTTP 2.0 HTTP 2.0
27
✅ ✅

qonto.eu
Still NOPE
Does it work now?

Fail #2
Query propagation
context=”qonto”,
environment=”staging”
environment=”production”
environment=”infra”
29
kafka_brokers{}

Fail #2
Solutions
○ Reduce the number of concurrent queries (--query.max-concurrent)
○ Modify all our Grafana dashboards to include filters on context and
environment labels
○ Be careful when requesting metrics.
30

qonto.eu
YES !
Does it work now?

Storage
Where and how to store the metrics
○ Thanos can use differents storage backend, GCP, S3, Azure Storage.
○ We use choose S3 for obvious reason
○ We need a way to upload metrics
○ And a way to retrieve them
34

Sidecar
Uploading the metrics
The sidecar uploads Prometheus
chunks each time a new one is
created
For each chunks, the sidecar
enhances the associated meta.json
with the value of the external labels
Can delay uploads to the storage
backend if it is not available
(network partition resilience)
Kubernetes secrets data
Prometheus operator config
35

Store Gateway
Retrive the metrics
Uses the same object-store
configuration as the sidecar
Implements the same StoreAPI
interface as the other Thanos
components
Maintains a local index cache of
meta.json file (not the actual chunks)
for performance purposeKubernetes deployment
37

Thanos Query
Store UI with Store
38

Fail #3
AWS bucket permission
○ Chunks uploaded from CBS account to the S3 bucket works
○ But the Store Gateway cannot read chunks from the CBS account
○ Sidecars cannot change objects owner when uploading
40

Downsampling
Keep only relevant data
43
○ Raw metrics data are stored, retrieved and displayed
○ But do we need the same precision for old data?
○ We need to find a way to reduce the precision over time, it is called
downsampling

Compactor
Retrive the metrics
Define retention time interval for
raw, five minutes and one hour
data
Can either run as a Cronjob or a
one shot job
Does not have a max age limit, you
are responsible to remove chunks
you do not want anymore
44

Benefits of the project
○ Metrics retention during two years
○ Infinite storage without worrying
○ Selling pure tech project / refactoring to stakeholders is more easily
■ Retrospective
■ Enhancement, applications publish their own prometheus metrics
○ First immersion in the Open Source community
46

Next Steps
possible with Thanos
○ Alerting with the Thanos Rulers
○ Thanos Store sharding based on metrics time range
○ Prometheus remote write to Thanos Receiver
47

CNCF Thanos @ Qonto

Recommended

Recommended

More Related Content

Similar to CNCF Thanos @ Qonto

Similar to CNCF Thanos @ Qonto (20)

Recently uploaded

Recently uploaded (20)

CNCF Thanos @ Qonto

Editor's Notes