4. Prometheus
- TSDB
- Open Source
- Incubated by CNCF (After Kubernetes)
- Adapted to VM/containers monitoring
- Autodiscovery
- Pull model
- Multidimensional data
- Includes alerting
5. Grafana
- OS metric analytics / visualisation
- multiple providers: CloudWatch, Prometheus, InfluxDb, ES, ..
- multiple dashboards already available
- in coop with Prometheus exporters
10. 1. Debugging / Gain insight
"Where does the problem come from / What is going on?"
● Segment by sources (Google Ads, Fb Ads, Bing Ads, Taboola, etc.)
○ Did they slow down? Error rate gone up? Are they unavailable?
● Segment by category
○ Did we introduce a bug on that code?
● Segment by node
○ do I have a problem on that node?
13. 1. Debugging / Gain insight
Combination with external data / corroboration
- deployments
- CPU/Ram/Load on the node
- “can we corroborate with a slow query increase in Mongodb?”
15. 2. Alerting
- Grafana alerts:
- alerts based on configured data sources
- Prometheus AlertManager:
- can alert based on PromQL query
- Infrastructure as Code
Instrument now, decide later
18. 3. Trends / Scale
● Trends over time, drive scale (technical) / business decisions
○ Capacity planning
○ "Will I (when will I) have a problem in the future?"
● SLA / QoS
20. Push (vs pull)
- Async, short-lived processes
- The prometheus way => send metrics to a push gateway
- One push gateway per process !
- More infrastructure to setup
- Our way, the prometheus-distributed-client => send metrics to a database
- Available from everywhere
- Consistent in case of concurrent calls
- Use either
21. Conclusion
- Try to always instrument your code
- Limite the cardinality of the metrics you use
- Make nice graphs !
- Use Our lib : https://github.com/dolead/prometheus-distributed-client