Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CERN IT Monitoring

223 views

Published on

How CERN IT performs data centre and physics grid monitoring

Published in: Government & Nonprofit
  • Be the first to comment

CERN IT Monitoring

  1. 1. Monitoring with no limits Nikolay Tsvetkov
  2. 2. Nikolay Tsvetkov Senior Software Engineer Service Manager At CERN since 2013 n.tsvetkov@cern.ch 1
  3. 3. Physics Lab Since 1954 23 Member states 2500 Employees CMS ATLAS LHCb ALICE Large Hadron Collider Leading Physics Laboratory 2
  4. 4. Large Hadron Collider ~ 27km long ~ 100m under the ground 1.9K (-271.3 C) operating temperature 11245 rounds per second ! > 1 billion collisions per second 3
  5. 5. LHC Detectors CMS / ATLAS / ALICE / LHCb Heavier than the Eiffel Tower CMS solenoid is the most powerful ever built: o 4 Tesla magnetic field > 100 000 Earth’s o Size 6x13m 4
  6. 6. Detectors Data Taking > 1 billion collisions per second Filtered out to ~ 200 “interesting” events/s Data flow from all 4 detectors ~ 25 GB/s 5
  7. 7. CERN Data Centre (DC) o 15 000 servers o 260 000 processor cores o 130 000 disks and 30 000 magnetic tapes o 340 petabytes of data permanently archived o 115 petabytes of data written on magnetic tape only in 2018 6
  8. 8. WLCG A community of 12,000 physicists: • ~300,000 jobs running concurrently • 170 sites • 900,000 processing cores • 700 PB storage available worldwide • 15% of the resources are at CERN • 20-40 Gbit/s connect CERN to Tier1s 7
  9. 9. CERN IT Monitoring Monitoring as a Service for CERN Data Centre (DC), IT Services and the WLCG collaboration Collect, transport, store and process metrics and logs for applications and infrastructure 8
  10. 10. 2016 MONIT was born to provide better monitoring infrastructure to CERN IT effective scalable sustainable 9
  11. 11. Challenges o from ~ 40k machines o > 3 TB/day (compressed) o Input rate ~ 100 kHz Data rate & volume 10 Challenges
  12. 12. Variety Heterogeneous clients: o IT Data Center o WLCG transfers o Experiments Challenges 11
  13. 13. Reliability o spikes in rate and volume o external service dependencies Challenges 12
  14. 14. Migrate from legacy dashboards and tools Stay up to date with upstream tools & trends Build community, internal and external Non-technical 13 Challenges
  15. 15. Goals Flexible on schema requirements JSON/HTTP gateways o Integrate custom metrics, logs and alarms Specific gateways o Collectd, Prometheus, ActiveMQ, JDBC … Easy Data Integration 14
  16. 16. Goals Schema independent Data aggregation / enrichment functionality Steering to the required storage backend Fully based on open-source technologies Data Pipeline 15
  17. 17. Architecture HTTP JMS JDBC AVRO Processing HDFS InfluxDB ES 16 KC
  18. 18. Source / Sink the data pipeline Validation and simple data filtering Metadata enrichment HTTP JMS JDBC AVRO { "producer": "myproducer", "type": "mytype", ... "mymetricfield": "value" } 17 Connectors
  19. 19. Apache Flume Protocol-based agents (sources and sinks) : o JDBC, JMS, HDFS, Elastic, HTTP, Kafka Interceptor / Morphlines for event transformation 14 agent “types” in MONIT > 200 instances Scale horizontally Connectors 18
  20. 20. DC metrics producer Running on > 40k machines in the Data Center Collectd daemon collects metrics / alarms locally o Plugin based, out of the box OS monitoring o Framework for implementing custom plugins Local Flume agents for data buffering 19
  21. 21. Transport layer Backbone of our pipeline o decouples producers / consumers o enables stream processing o resilient (72 hours data retention) o reliable (3 replicas) 20
  22. 22. Kafka cluster On-premises ( v1.0.2) based on Openstack VMs o 20 brokers o ~ 15k partitions in total o CEPH volume (2TB each) as spool (be careful with storage latencies !) o Rack-awareness: 1 replica per “availability zone” 21 Transport layer
  23. 23. Processing platform Transformation o parsing, field extraction and filtering Enrichment o combine data from different sources Correlation / Aggregation o over time or other dimensions o anomaly detection Processing 22
  24. 24. Users Mesos (Marathon & Chronos) Mesoscluster Orchestrate CERN IT Hadoop/HDFSGitlab CI 23 Processing platform
  25. 25. Logstash integrated for on-the-fly log transformation Spark Structured Streaming in the lead role o joining data streams easily o handles late event Running ~ 20 Spark production jobs (24/7) 24 Processing platform
  26. 26. Providing the right storage for each use-case Integrating as data sources for visualization Direct query access through APIs Long term data archive Storage 25
  27. 27. Timeseries DB for storing metrics / alarms o > 30 instances (due to lack of cluster mode for free version) o Performance related to the data cardinality o Up to 15 years of retention policy (thanks to the automatic down-sampling) InfluxDB 26 Storage
  28. 28. Elasticsearch Distributed search and indexing engine o 3 clusters (syslog, service logs and metrics) o Store TS data with high cardinality fields o ~100 TB (total storage at 1 month RP) 27 Storage
  29. 29. HDFS Long term data archive platform for Big Data analysis o Kept forever (or by GDPR agreement) o Compressed JSON / Parquet o Partitioned by “date / producer / type” 28 Storage
  30. 30. Kafka Connect Kafka framework for exchanging data with other systems o Support variety of connector types o HDFS, S3, Elasticsearch, Influx, … o Single KC cluster handle different connectors o Resilient & scalable CONNEC T 29
  31. 31. o 10 VMs, 44 topics, 880 tasks o Writing to HDFS directly in Parquet (converting records from JSON ) o Connector per topic distributes well o Compaction required afterwards (creates too small files as buffers full block in memory) 30 Cluster: Kafka Connect
  32. 32. Visualization Grafana is a ”first-class citizen” • ~ 1000 dashboards over > 20 organizations • Users in charge of creating their own ones Kibana data exploration • Secured private endpoints for sensitive logs SWAN for data-analysis (notebooks) 31
  33. 33. Monitoring Of the Monitoring Second data pipeline for monitoring our infra All MONIT metrics, logs sent to both flows Data de-duplicated and merged at the storage level Using more external services for MOM to avoid replicating configuration problems (Kafka) 32
  34. 34. MOM Data Flow MOM MONIT Metrics & Logs 33 HTTP JMS JDBC AVRO KC HDFS HTTP HTTP AVRO HTTP External
  35. 35. Lessons Learned The pipeline approach pays back o reliable / resilient service o decouple / buffers / stream processing Kafka is a solid system backbone Connectors & Storages are the most operational expensive 34
  36. 36. What are the next steps? Extend the Kafka Connect usage Run the connectors on Kubernetes (K8s) Spark on K8s for processing platform Why not also looking into KSQL ? Looking into alternative Timeseries DBs 35
  37. 37. MONIT keep growing Only for the last 12 months: o + 30% new data producers (180 total) o + 20% data volume per day (~ 3.2 TB/day total) o + 400% new dashbords (1000 total) o increase to ~ 1 000 000 queries/day … more clients means new challenges ! 36
  38. 38. Summary MONIT is a flexible general purpose monitoring infrastructure Easily implementable at lower scale Approach that might serve other use-cases outside of the MONIT scope 37
  39. 39. Thank you !
  40. 40. Spare slides 39
  41. 41. Alarms Local on the machine o Simple Threshold / Actuators Grafana dashboard alarms External (Spark, Kapacitor , custom sources…) Integration with ticketing system o ServiceNow 40

×