Das CERN, die Europäische Organisation für Kernforschung, ist das weltweit größte Forschungszentrum für Teilchenphysik. Es werden dort Experimente in der Hochenergiephysik mit Hilfe des Teilchenbeschleunigers durchgeführt sowie anderer bereitgestellter Infrastrukturen. Die Untersuchungen rund um den Large Hadron Collider (LHC), erfordern umfangreiche IT-Infrastrukturen, um die Daten, die durch die Kollisionen generiert werden, zu verarbeiten. Sogar die Überwachung des LHC selbst hängt von einer komplexen Infrastruktur ab. Die CERN-IT stellt den Mitarbeitern und den Usern viele unterschiedliche Services bereit und ist vor allem aber der Hauptakteur des LHC GRID. Das GRID ist das weltweit verteilte Rechen- und Speicher-Netzwerk, das die nötige Kapazität zur Verfügung stellt um die Menge an Daten, die anhand des Teilchenbeschleunigers gesammelt wird, analysieren zu können. Es besteht aus 200.000 Cores verteilt auf 34 Ländern. All diese großen Rechenzentren erfordern ein sorgfältiges Monitoring, aber jedes für sich hat Besonderheiten, was dazu führt, dass unterschiedliche Monitoring Strategien und Tools angewandt werden müssen. Die unzähligen Herangehensweisen an diese Herausforderung werden in diesem Vortrag aufgezeigt sowie ein Ausblick auf geplante künftige Entwicklungen.
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
OSMC 2012 | Monitoring at CERN by Christophe Haen
1. Monitoring at CERN
Christophe Haen
christophe.haen@cern.ch
Universit´e Blaise Pascal, CLERMONT-FERRAND cedex, FR
LHCb Online team (CERN), GENEVA, CH
17th October 2012
2. Disclaimer
Disclaimer
This presentation covers only a very small fraction of the
monitoring done at CERN
It is not intended to be a benchmark or a comparison of the
solutions
I am not an expert of all the software
3. Plan
1 CERN
Overview
The LHC project
Global picture
2 Monitoring at CERN
LHC
Experiments
LHCb
ATLAS
CMS
Alice
IT
Grid
3 Outlooks
4 Conclusion
4. What people know about CERN
Glass Cathedral
World destroyer machines
Crazy physicists
Monitoring at CERN 1 C. Haen
5. What CERN really is
CERN in a nutshell
European Organization for Nuclear Research
Founded in 1954, near Geneva, Switzerland
Research, Technology, Collaborating and Education
Provides infrastructures needed for high-energy physics (HEP)
research
One of the world’s largest and most respected centers for
scientific research
Monitoring at CERN 2 C. Haen
7. LHC
One Ring to rule them all
World’s largest and
highest-energy particle
accelerator
27 kilometers in
circumference,≈ 100 m
underground, 14 TeV, 11k
turns per second, 40 MHz
collision rate, -271 ◦C...
Answer some of the
fundamental open
questions in physics
Monitoring at CERN 4 C. Haen
10. Alice
Alice
A Large Ion Collider Experiment
Quark-gluon plasma
26m long, 16m high, 16m wide, 10k tonnes
Monitoring at CERN 7 C. Haen
11. Atlas
Atlas
A Toroidal LHC ApparatuS
General-purpose detector : Higgs boson, extra dimension,
dark matter
46 m long, 25 m high and 25 m wide, 7k tonnes
Monitoring at CERN 8 C. Haen
12. CMS
CMS
Compact Muon Solenoid
General-purpose detector : same goal as Atlas but different
technical solutions
21 m long, 15 m wide and 15 m high, 12 500 tonnes
Monitoring at CERN 9 C. Haen
13. LHCb
LHCb
Large Hadron Collider beauty experiment
Studies the origin of the difference between matter and
anti-matter
21m long, 10m high and 13m wide, 5600 tonnes
Monitoring at CERN 10 C. Haen
14. Worldwide LHC Computing Grid
Grid
Experiments produce 25
petabytes of data per year
100k processors, 170
sites, 36 countries
Monitoring at CERN 11 C. Haen
15. Global picture
In brief
LHC produces data
Experiments record data
Grid analyzes data
CERN operates all this
Monitoring!
All this requires IT infrastructures, and thus monitoring
Monitoring at CERN 12 C. Haen
16. CERN accelerator’s controls infrastructure
accelerator’s controls infrastructure
Control the proper behavior of the accelerators
Large geographical distances
Big diversity of equipment
Equipment
1500 Servers
PLC (Programmable logic controller)
300 fan trays
VME crates
Mostly Linux, including real time
Monitoring at CERN 13 C. Haen
17. Diamon
Motivations
Provide the accelerator operators with precise and easy to use
tools to monitor the behavior of the Controls Infrastructure
Allow for an easy access to diagnostic tools providing more
details and help to solve an eventual problem
Requirements
Very low footprint on the client
Highly configurable interface
Monitoring at CERN 14 C. Haen
18. C2
MON
Requirements
Stick to proven technologies
Open-source resources when possible (always basic
open-source fall-back option)
Choices
Java Spring.
JMS middleware with ActiveMQ
Oracle database but other DB possible
iBatis java library for persistence
Wrapper of Ehcache for the server cache
Terracotta for cluster setup
Monitoring at CERN 15 C. Haen
19. C2
MON
C2MON
3-tier architecture
DAQ for a number
of protocols/equip-
ment
Core designed to
run in a clustered
setup
C2MON Client API
for client
communication
Monitoring at CERN 16 C. Haen
20. C2
MON
DAQ
One DAQ per ’type of check’ (Equipment)
Many DAQ available (SNMP, SSH, CLIC...)
New check using existing DAQ can be added on the fly
(metric)
React to configuration changes (Restart/on the fly)
Filtering capabilities
Monitoring at CERN 17 C. Haen
21. C2
MON
Core
Modular architecture (custom plugins)
Cluster or stand-alone configuration
New server added/removed on the fly
No downtime possibility
Database backend cached
Client
Based on a common API
Many views provided (video stream, web...)
Replay functionality
Monitoring at CERN 18 C. Haen
22. Why do we need a computer infrastructure
A few numbers
LHC crossing rate of
40 MHz
600 million events per
second
kb < Event size < Mb
⇒ Need to filter the events
Monitoring at CERN 19 C. Haen
23. LHCb
Environment
2k servers
200 switches
400 embedded processors
Linux ( 90%) & Windows
Farm
Diskless server for processing : farm node
1 control server for ≈ 30 nodes (pxe boot, dhcp, NFS...)
≈ 60 control servers
Monitoring at CERN 20 C. Haen
24. LHCb
Setup
ICINGA 1.7.2
Ido2db with MySQL
Mod gearman 1.3.8
Setup details
60 mod gearman workers
NRPE, NSCLient++, SNMP
Few custom checks (GPFS, filesystem speed,...)
Monitoring at CERN 21 C. Haen
25. ATLAS
Environment
3k hosts with up to 40 checks
Similar environment as LHCb
No system monitoring of the frontend boards
Current Setup (System monitoring only)
80 independent Nagios 2.5 customized
Single MySQL cluster (ndoutils) and RRD storage (NAS)
Custom web interface for overall status
NRPE, IPMI sensors, SNMP
1 year DB : 8.5Gb
1 year RRD : 18 Gb
Monitoring at CERN 22 C. Haen
26. ATLAS
Requirements
Keep (some) configuration compatibility
keep using the custom ConfDB for config generation
Prospects
Ganglia
Icinga + mod gearman
Monitoring at CERN 23 C. Haen
27. CMS
Environment
3k hosts
170 switches
No diskless machine
Current Setup (System monitoring only)
1 central Icinga instance
1 gearman worker
1 PNP4nagios server
NRPE, check multi
90k checks every 2 minutes
Use of JSON output for custom scripts
Monitoring at CERN 24 C. Haen
34. LEMON
LHC Era Monitoring
Part of the ELFms tool suite (Extremely Large Fabric
management system)
For Linux systems
Server/client based
Monitoring agent on each monitored node
Server centralizes all the data (push/pull)
Lemon-web, lemon-cli
Alarm system
Rule engine
Monitoring at CERN 31 C. Haen
35. LEMON
Client
Agent : start and
configure sensors
Sensor : implement
metric classes
metric class: given
measure (e.g : CPU load)
metric instance : given
measure in given
configuration
Monitoring at CERN 32 C. Haen
37. LEMON
LEMON in the
computer center
11k entities
monitored
250 metric classes
1k unique metrics
1.7 million different
informations
1 year DB : 6Tb
8 years rrd files : 45
Gb
Monitoring at CERN 34 C. Haen
38. WLCG
WLCG
Worldwide LHC Computing Grid
35 countries over the globe
170 sites
Very heterogeneous environment
Monitoring at CERN 35 C. Haen
39. Service Availability Monitoring
Fully distributed monitoring framework
Based on open source systems
High scalability
Advanced notification and reporting system
Web interface and REST API
Monitoring at CERN 36 C. Haen
43. Agile Project
Why
CERN data center
is reaching its limits
Custom tools are
high maintenance
2015 : 15k servers,
300k VMs needed
Solution : become standard
Monitoring at CERN 40 C. Haen
45. Atlas Automated and Intelligent Assistant
Assistant
Based on Esper
Complex Event
Processing
Pattern recognition
Time-based event
correlation
SQL like language
Monitoring at CERN 42 C. Haen
46. LHCb expert system
Caution
Still experimental
Phronesis
Linux system only
Diagnostics &
Recovery
Automatic
dependency
discovery
Reinforcement
learning
Experience sharing
Monitoring at CERN 43 C. Haen
48. Conclusion
To conclude...
Only a small fraction of monitoring at CERN presented
Many aspects were not addressed
A unique tool cannot do everything
Aggregate tools and results
Use the results
More smartness
Participate to the community
Monitoring at CERN 45 C. Haen
50. Acknowledgment
Many Thanks to...
Pedro Andrade, Sergio Ballestrero, Alastair Bland, Felix Ehm, Ivan
Fedorko, Thorsten Kollegger, wojciech Lapka, Olivier Raginel,
Adriana Telesca and Falco Vennedey for the interesting discussions
and materials.
Jorg Wenninger for the picture from the Jura.
Monitoring at CERN 47 C. Haen
51. Sources
Sources
Alice HLT monitoring:
http://iopscience.iop.org/1742-6596/331/5/052003/
http://cdsweb.cern.ch/record/1454269
Atlas monitoring:
http://cdsweb.cern.ch/record/1455464
http://cdsweb.cern.ch/record/1450129
C2MON :
http://wikis.cern.ch/display/C2MON/C2MON+Home
Cern Agile:
https://twiki.cern.ch/twiki/pub/Main/TimBellPresentationList/20120524 CERN Data Centre Evolution v2.ppt
Cern experiments:
http://public.web.cern.ch/public/en/lhc/LHCExperimentsen.html
Cern IT department:
http://information-technology.web.cern.ch/
CMS system monitoring:
CHEP 2012 proceedings : Health And Performance Monitoring Of The Online Computer Cluster Of CMS,
G. Bauer
Esper:
http://esper.codehaus.org/
Ganglia:
http://ganglia.sourceforge.net/
Icinga:
https://www.icinga.org/
Monitoring at CERN 48 C. Haen