Slide créé sur google slides https://docs.google.com/presentation/d/1pZvS5BEFfXceS3xXIePKkeAx-aZpxhloNInaIHD5eTw/edit?usp=sharing
Comment monitorer ce qu’on ne connait pas? Un des défis technique chez Clever CLoud, à part la scalabilité, c’est de monitorer automatiquement toutes les stacks techniques de nos clients, sans que l’on sache quoi que ce soit. Notre premier but quand nous avons reconstruit notre plateforme de monitoring était de supporter notre pattern Imutable Infrastructure qui génère quantité de hosts éphémères chaque minute. L’approche traditionel est de se concentrer sur les VMs et les Hosts, pas les applications?
Il fallait changer de paradigme pour avoir une approche de découverte automatique des métriques à monitorer, permettre à du code tiers de publier ses propres métriques. Ce talk décrit le chemin qui nous a ammené à construire Clever Cloud Metrics, basé sur Warp10 ( basé sur Kafka/Hadoop/Storm) pour améliorer les conditions de travail de nos utilisateurs et la stabilité de nos applications.
58. @clementd & @waxzce
Poke, good place to get sample
https://poke.digital
https://docs.google.com/presentation/d/1RfpX-KdfAa5ZxsnuRYi34JbidbVVUcz7bYy5x-
k6BBE/edit?usp=sharing
https://www.pscp.tv/waxzce/1OwGWEEvapkxQ?t=4m51s
Clever Cloud:
IT automation, on gère la prod, et on l’automatise
Public Cloud, Entreprise, On Prem
Interesting things happen when you’re not looking.For errors, you have logs, sure, but for subtler things, you need a finer view (CPU use, network usage, latency, …)
Usually metrics are an ops thing: system-level metrics like CPU, RAM, network.Application-level metrics are important: JVM GC status, number of active sessions, number of validated carts…
Zabbix, Centreon stuff like that are made for ops. Metrics are gathered host by host (ie machine by machine), and access control is complicated.
Normally only ops have access to this platform, so it’s complex for devs to have access
Application-level metrics are super important, and help to produce better applications.
We tend to use metrics to gather information about outages or issues, but also to see how people use the applications, or to help with perf optimization efforts
There are metrics system for application-level metrics, but both views are important.
Servers still happen, so application-level and system-level metrics are both equally important, they give interesting context with each other.
You need an unified metrics gathering pipeline, both for system and application-level metrics
Systems like zabbix and centreon are not suited for new architectures
Traditional monitoring / metrics in an immutable infrastructure does not make sense
VMs are short-lived and disposable. In zabbix, metrics are lost when a VM is shut down.
in an immutable infrastructure, one service is handled by many servers along its lifetime. you need to aggregate metrics by service, not see it server by server.
Instead of a few servers, there are now hundreds or thousands of ephemeral virtual machines.
Traditional metrics pipeline have a hard time with that.
Clever Cloud: 20GB / hour
~100 series per VM, ~1000 new VMs everyday.
Need to aggregate series with metadata, not feasible to consume data series by series.
Immutable infrastructure, Thousands of applications, lots of tech stacks.
Collection happens on the VMs, both for system & application level metrics
We can’t just send metrics naively to the platform.
Buffer: gather metrics and send batches periodically
Retry: re-send metrics if needed
Jitter: make sure all agents don’t send metrics at the same time
Written in go, easy to deploy, lots of input plugins
Understands the statsd protocol. Does local aggregation
Polls a prometheus endpoint, does gathering
statsd is push-based: metrics are sent directly. Less work on the app side
prometheus is pull-based: app is queried by the agent. not so convenient, but with telegraf it’s not so bad.
lots of data (1 point every minute * 100, + 1 point every 10 seconds for a few series, like CPU & RAM)
we have specific needs: immutable infra, lots of applications, lots of tech stacks
several thousand instances running at the same time
We need both real time access for the dashboard,
and analysis over longer periods.
DBs made to store Time Series
TimeSeries: a Series of points over time. Eg used RAM over time, or % Idle CPU time. Each point is the same thing, at a different time.
TSDBs have different needs than regular DBs
What is cardinality, why is it an important thing?
Allow access delegation, everybody talks directly to the DB.
Secure by default, simpler network config
Different way to use the stored data.
Just list points
Aggregate and filter things
Aggregate and filter things
https://quantum.services.clever-cloud.com/#/warpscript/IlFCUGY0WkREYk5DNXE3TEwyVkUzVFRSZEdWZVk2V0Nqbm5JLnUxZ2hBRm9fMmNrSjNkZXMyQXlxcmsyWGd5T3RRZ3RKeFQ5d3dLblFaTi53WEdVVnlsNnZLbUN3Zzc4clIyWHYudE5fWHRXZ3JlUXZiQjhsVnVGQTVjS1NSd05tUkYzZUpReVFNOUYwdmR6NGdGRFJZVTNFNkRDOFZxUjhkMEhZZkwuTldxRnJxMk1OLmdRUGxoSXJBYUp6X29MZTdvZUpydFNXOFJ3IgondG9rZW4nIFNUT1JFCgokdG9rZW4KJ2Zhc3RfY3B1LnVzYWdlX2lkbGUnCnsgJ2hvc3QnICd%2BKDI2NjhiNDQ1LThiMmQtNDk1MS05OGRlLTgyYjcxNGVkOGU3YXw5ZTk0ZDU5NC0xOTU5LTQ2MjUtOTAwNC1iYmMzY2FlOGU2MTApJyAnYXBwX2lkJyAnPWFwcF9iODg2MTdhOC02MDRmLTQ2N2YtYTk2Zi02MDZjYWJhYjNjODYnICdjcHUnICdjcHUtdG90YWwnIH0KTk9XIC0xNDAKRkVUQ0g%3D/eyJ1cmwiOiJodHRwczovL2MxLXdhcnAxMC1jbGV2ZXJjbG91ZC1jdXN0b21lcnMuc2VydmljZXMuY2xldmVyLWNsb3VkLmNvbS9hcGkvdjAiLCJmZXRjaEVuZHBvaW50IjoiL2ZldGNoIiwiaGVhZGVyTmFtZSI6IlgtV2FycDEwIn0%3D