5. ○ All our services run on AWS in two distinct accounts
○ Kubernetes user with 6 clusters
○ Monitoring is ensured by Prometheus (with its operator) and Grafana
Context
What is the tech at Qonto
5
6. Problems
What leads to Thanos
6
○ How to have a single view on all theses clusters ?
○ How to keep this vision over time ?
7. First try
What I have done in 2015
Prometheus remote write
Reformat metrics into
Warp10 formats
Learn and deploy
Warp10 components
Learn and deploy Kafka
0.8.2.4
Learn and deploy HDFS
Rewrite all dashboards
to use WarpScript
Train all devs and ops
Maintain it
7
9. What we really want
And dare asking
Seamless integration with
Prometheus and Grafana
Managed storage outside
Kubernetes
Easily modulable and
lightweight architecture
9
10. Why not Prometheus federation
Easy peasy, end of story?
HA Prometheus federation
Federated Prometheus
(one per kubernetes cluster)
10
11. Why not federation
Still the same problems
○ Which prometheus to query ?
○ At which interval the federated Prometheus should be scrapped
○ Storage is still handled by Kubernetes (more generally a finite hard drive)
11
12. Thanos.io
○ Open source, written in golang by Improbable
○ Based on Prometheus 2.0 engine and GRPC
○ Split into distinct components
○ Highly available Prometheus setup with long term storage capabilities
12
15. Thanos sidecar
What is a sidecar
A sidecar is a utility container in the Pod and its purpose is to support the main container.
In our case, the Thanos sidecar run alongside each Prometheus and :
● exposes a GRPC endpoint
● implements Thanos Store API
● access Prometheus chunks
15
18. Thanos sidecar
How to differentiate sidecars
context
qonto and cbs
environment
production, staging and infra
kafka_brokers{context="qonto", environment="staging", service="kafka-broker-exporter"} 3
kafka_brokers{context="qonto", environment="production", service="kafka-broker-exporter"} 5
Leverage on Prometheus external labels feature
18
19. Thanos query
Link sidecars together
○ Centralize, propagate PromQL to sidecars and merge results
○ Act as a Prometheus data source in ou Grafana
○ Stick to the same UI as Prometheus
○ Also implements Thanos StoreApi
19
20. Thanos query
Service discovery
○ Gossip protocol since the beginning, removed in v0.5.0
○ File service discovery
○ Static list of Thanos components that implements their storeApi
20
25. qonto.eu
NOPE
Did it work the first time?
component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error:
code = DeadlineExceeded desc = context deadline exceeded" address=thanos-sidecar.production.qonto.co
30. Fail #2
Solutions
○ Reduce the number of concurrent queries (--query.max-concurrent)
○ Modify all our Grafana dashboards to include filters on context and
environment labels
○ Be careful when requesting metrics.
30
34. Storage
Where and how to store the metrics
○ Thanos can use differents storage backend, GCP, S3, Azure Storage.
○ We use choose S3 for obvious reason
○ We need a way to upload metrics
○ And a way to retrieve them
34
35. Sidecar
Uploading the metrics
The sidecar uploads Prometheus
chunks each time a new one is
created
For each chunks, the sidecar
enhances the associated meta.json
with the value of the external labels
Can delay uploads to the storage
backend if it is not available
(network partition resilience)
Kubernetes secrets data
Prometheus operator config
35
37. Store Gateway
Retrive the metrics
Uses the same object-store
configuration as the sidecar
Implements the same StoreAPI
interface as the other Thanos
components
Maintains a local index cache of
meta.json file (not the actual chunks)
for performance purposeKubernetes deployment
37
40. Fail #3
AWS bucket permission
○ Chunks uploaded from CBS account to the S3 bucket works
○ But the Store Gateway cannot read chunks from the CBS account
○ Sidecars cannot change objects owner when uploading
40
43. Downsampling
Keep only relevant data
43
○ Raw metrics data are stored, retrieved and displayed
○ But do we need the same precision for old data?
○ We need to find a way to reduce the precision over time, it is called
downsampling
44. Compactor
Retrive the metrics
Define retention time interval for
raw, five minutes and one hour
data
Can either run as a Cronjob or a
one shot job
Does not have a max age limit, you
are responsible to remove chunks
you do not want anymore
Kubernetes deployment
44
46. Benefits of the project
○ Metrics retention during two years
○ Infinite storage without worrying
○ Selling pure tech project / refactoring to stakeholders is more easily
■ Retrospective
■ Enhancement, applications publish their own prometheus metrics
○ First immersion in the Open Source community
46
47. Next Steps
possible with Thanos
○ Alerting with the Thanos Rulers
○ Thanos Store sharding based on metrics time range
○ Prometheus remote write to Thanos Receiver
47