SlideShare a Scribd company logo
1 of 52
Download to read offline
Monitoring at CERN
Christophe Haen
christophe.haen@cern.ch
Universit´e Blaise Pascal, CLERMONT-FERRAND cedex, FR
LHCb Online team (CERN), GENEVA, CH
17th October 2012
Disclaimer
Disclaimer
This presentation covers only a very small fraction of the
monitoring done at CERN
It is not intended to be a benchmark or a comparison of the
solutions
I am not an expert of all the software
Plan
1 CERN
Overview
The LHC project
Global picture
2 Monitoring at CERN
LHC
Experiments
LHCb
ATLAS
CMS
Alice
IT
Grid
3 Outlooks
4 Conclusion
What people know about CERN
Glass Cathedral
World destroyer machines
Crazy physicists
Monitoring at CERN 1 C. Haen
What CERN really is
CERN in a nutshell
European Organization for Nuclear Research
Founded in 1954, near Geneva, Switzerland
Research, Technology, Collaborating and Education
Provides infrastructures needed for high-energy physics (HEP)
research
One of the world’s largest and most respected centers for
scientific research
Monitoring at CERN 2 C. Haen
What CERN really is
Monitoring at CERN 3 C. Haen
LHC
One Ring to rule them all
World’s largest and
highest-energy particle
accelerator
27 kilometers in
circumference,≈ 100 m
underground, 14 TeV, 11k
turns per second, 40 MHz
collision rate, -271 ◦C...
Answer some of the
fundamental open
questions in physics
Monitoring at CERN 4 C. Haen
LHC
Monitoring at CERN 5 C. Haen
Four main experiments
Monitoring at CERN 6 C. Haen
Alice
Alice
A Large Ion Collider Experiment
Quark-gluon plasma
26m long, 16m high, 16m wide, 10k tonnes
Monitoring at CERN 7 C. Haen
Atlas
Atlas
A Toroidal LHC ApparatuS
General-purpose detector : Higgs boson, extra dimension,
dark matter
46 m long, 25 m high and 25 m wide, 7k tonnes
Monitoring at CERN 8 C. Haen
CMS
CMS
Compact Muon Solenoid
General-purpose detector : same goal as Atlas but different
technical solutions
21 m long, 15 m wide and 15 m high, 12 500 tonnes
Monitoring at CERN 9 C. Haen
LHCb
LHCb
Large Hadron Collider beauty experiment
Studies the origin of the difference between matter and
anti-matter
21m long, 10m high and 13m wide, 5600 tonnes
Monitoring at CERN 10 C. Haen
Worldwide LHC Computing Grid
Grid
Experiments produce 25
petabytes of data per year
100k processors, 170
sites, 36 countries
Monitoring at CERN 11 C. Haen
Global picture
In brief
LHC produces data
Experiments record data
Grid analyzes data
CERN operates all this
Monitoring!
All this requires IT infrastructures, and thus monitoring
Monitoring at CERN 12 C. Haen
CERN accelerator’s controls infrastructure
accelerator’s controls infrastructure
Control the proper behavior of the accelerators
Large geographical distances
Big diversity of equipment
Equipment
1500 Servers
PLC (Programmable logic controller)
300 fan trays
VME crates
Mostly Linux, including real time
Monitoring at CERN 13 C. Haen
Diamon
Motivations
Provide the accelerator operators with precise and easy to use
tools to monitor the behavior of the Controls Infrastructure
Allow for an easy access to diagnostic tools providing more
details and help to solve an eventual problem
Requirements
Very low footprint on the client
Highly configurable interface
Monitoring at CERN 14 C. Haen
C2
MON
Requirements
Stick to proven technologies
Open-source resources when possible (always basic
open-source fall-back option)
Choices
Java Spring.
JMS middleware with ActiveMQ
Oracle database but other DB possible
iBatis java library for persistence
Wrapper of Ehcache for the server cache
Terracotta for cluster setup
Monitoring at CERN 15 C. Haen
C2
MON
C2MON
3-tier architecture
DAQ for a number
of protocols/equip-
ment
Core designed to
run in a clustered
setup
C2MON Client API
for client
communication
Monitoring at CERN 16 C. Haen
C2
MON
DAQ
One DAQ per ’type of check’ (Equipment)
Many DAQ available (SNMP, SSH, CLIC...)
New check using existing DAQ can be added on the fly
(metric)
React to configuration changes (Restart/on the fly)
Filtering capabilities
Monitoring at CERN 17 C. Haen
C2
MON
Core
Modular architecture (custom plugins)
Cluster or stand-alone configuration
New server added/removed on the fly
No downtime possibility
Database backend cached
Client
Based on a common API
Many views provided (video stream, web...)
Replay functionality
Monitoring at CERN 18 C. Haen
Why do we need a computer infrastructure
A few numbers
LHC crossing rate of
40 MHz
600 million events per
second
kb < Event size < Mb
⇒ Need to filter the events
Monitoring at CERN 19 C. Haen
LHCb
Environment
2k servers
200 switches
400 embedded processors
Linux ( 90%) & Windows
Farm
Diskless server for processing : farm node
1 control server for ≈ 30 nodes (pxe boot, dhcp, NFS...)
≈ 60 control servers
Monitoring at CERN 20 C. Haen
LHCb
Setup
ICINGA 1.7.2
Ido2db with MySQL
Mod gearman 1.3.8
Setup details
60 mod gearman workers
NRPE, NSCLient++, SNMP
Few custom checks (GPFS, filesystem speed,...)
Monitoring at CERN 21 C. Haen
ATLAS
Environment
3k hosts with up to 40 checks
Similar environment as LHCb
No system monitoring of the frontend boards
Current Setup (System monitoring only)
80 independent Nagios 2.5 customized
Single MySQL cluster (ndoutils) and RRD storage (NAS)
Custom web interface for overall status
NRPE, IPMI sensors, SNMP
1 year DB : 8.5Gb
1 year RRD : 18 Gb
Monitoring at CERN 22 C. Haen
ATLAS
Requirements
Keep (some) configuration compatibility
keep using the custom ConfDB for config generation
Prospects
Ganglia
Icinga + mod gearman
Monitoring at CERN 23 C. Haen
CMS
Environment
3k hosts
170 switches
No diskless machine
Current Setup (System monitoring only)
1 central Icinga instance
1 gearman worker
1 PNP4nagios server
NRPE, check multi
90k checks every 2 minutes
Use of JSON output for custom scripts
Monitoring at CERN 24 C. Haen
Alice HLT
Environment
220 servers
63 switches
FPGAs
GPUs
Monitoring
Ganglia
SysMES (home-made system)
Monitoring at CERN 25 C. Haen
Ganglia
Distributed
monitoring system
Groups of clusters
Unicast or
Multicast
communication
RRDtool
gmond, gmetad,
ganglia-web
Monitoring at CERN 26 C. Haen
Monitoring at CERN 27 C. Haen
SysMES
Multi-layered, scalable,
decentralized, fault
tolerant, dynamic
framework
Industry standards : XML,
EJB, JBOSS, CIM...
Inventory module
Monitoring
Rule based tool set
Monitoring at CERN 28 C. Haen
IT
IT department
Databases, Virtual machines, Storage, Mail, Web, etc
Linux, Windows, Mac
9k servers : 17k NIC, 70k cores, 70k disks, 205 Tb of memory
2600+ switches
3.5 MW facility
Monitoring at CERN 29 C. Haen
IT department
Monitoring at CERN 30 C. Haen
LEMON
LHC Era Monitoring
Part of the ELFms tool suite (Extremely Large Fabric
management system)
For Linux systems
Server/client based
Monitoring agent on each monitored node
Server centralizes all the data (push/pull)
Lemon-web, lemon-cli
Alarm system
Rule engine
Monitoring at CERN 31 C. Haen
LEMON
Client
Agent : start and
configure sensors
Sensor : implement
metric classes
metric class: given
measure (e.g : CPU load)
metric instance : given
measure in given
configuration
Monitoring at CERN 32 C. Haen
LEMON
Monitoring at CERN 33 C. Haen
LEMON
LEMON in the
computer center
11k entities
monitored
250 metric classes
1k unique metrics
1.7 million different
informations
1 year DB : 6Tb
8 years rrd files : 45
Gb
Monitoring at CERN 34 C. Haen
WLCG
WLCG
Worldwide LHC Computing Grid
35 countries over the globe
170 sites
Very heterogeneous environment
Monitoring at CERN 35 C. Haen
Service Availability Monitoring
Fully distributed monitoring framework
Based on open source systems
High scalability
Advanced notification and reporting system
Web interface and REST API
Monitoring at CERN 36 C. Haen
Monitoring at CERN 37 C. Haen
Multiple components
Aggregated
topology database
Profile database
Result database
Messaging :
ActiveMQ
Probes
Monitoring : Nagios
Monitoring at CERN 38 C. Haen
Setup
45 Nagios instances
800k records per
day
6 month of history :
800 Gb
Monitoring at CERN 39 C. Haen
Agile Project
Why
CERN data center
is reaching its limits
Custom tools are
high maintenance
2015 : 15k servers,
300k VMs needed
Solution : become standard
Monitoring at CERN 40 C. Haen
Agile Project
Monitoring
>30 monitoring
applications
40k producers
280Gb per day
Technologies
producers : e.g.
lemon
Aggregation :
Appolo
Processing : HBase
Futur architecture
Monitoring at CERN 41 C. Haen
Atlas Automated and Intelligent Assistant
Assistant
Based on Esper
Complex Event
Processing
Pattern recognition
Time-based event
correlation
SQL like language
Monitoring at CERN 42 C. Haen
LHCb expert system
Caution
Still experimental
Phronesis
Linux system only
Diagnostics &
Recovery
Automatic
dependency
discovery
Reinforcement
learning
Experience sharing
Monitoring at CERN 43 C. Haen
LHCb expert system
Experience sharing Learning speed comparison
Monitoring at CERN 44 C. Haen
Conclusion
To conclude...
Only a small fraction of monitoring at CERN presented
Many aspects were not addressed
A unique tool cannot do everything
Aggregate tools and results
Use the results
More smartness
Participate to the community
Monitoring at CERN 45 C. Haen
Questions
Monitoring at CERN 46 C. Haen
Acknowledgment
Many Thanks to...
Pedro Andrade, Sergio Ballestrero, Alastair Bland, Felix Ehm, Ivan
Fedorko, Thorsten Kollegger, wojciech Lapka, Olivier Raginel,
Adriana Telesca and Falco Vennedey for the interesting discussions
and materials.
Jorg Wenninger for the picture from the Jura.
Monitoring at CERN 47 C. Haen
Sources
Sources
Alice HLT monitoring:
http://iopscience.iop.org/1742-6596/331/5/052003/
http://cdsweb.cern.ch/record/1454269
Atlas monitoring:
http://cdsweb.cern.ch/record/1455464
http://cdsweb.cern.ch/record/1450129
C2MON :
http://wikis.cern.ch/display/C2MON/C2MON+Home
Cern Agile:
https://twiki.cern.ch/twiki/pub/Main/TimBellPresentationList/20120524 CERN Data Centre Evolution v2.ppt
Cern experiments:
http://public.web.cern.ch/public/en/lhc/LHCExperimentsen.html
Cern IT department:
http://information-technology.web.cern.ch/
CMS system monitoring:
CHEP 2012 proceedings : Health And Performance Monitoring Of The Online Computer Cluster Of CMS,
G. Bauer
Esper:
http://esper.codehaus.org/
Ganglia:
http://ganglia.sourceforge.net/
Icinga:
https://www.icinga.org/
Monitoring at CERN 48 C. Haen
Sources
Sources
LEMON:
http://lemon-monitoring.web.cern.ch/
nagios:
http://www.nagios.org/
Phronesis:
CHEP 2012 proceedings : Artificial intelligence in the service of system administrators, C. Haen
SAM :
http://tomtools.cern.ch/confluence/display/SAMWEB/Home
http://gridmonitoring.cern.ch/mywlcg
SysMES:
http://wiki.kip.uni-heidelberg.de/ti/SysMES/index.php/Main Page
Monitoring at CERN 49 C. Haen

More Related Content

What's hot

Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologiesBrendan Gregg
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveTim Bell
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsBrendan Gregg
 
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in DetailOverlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in DetailJose Antonio Coarasa Perez
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3Tim Bell
 
Biosystem prosedur
Biosystem prosedurBiosystem prosedur
Biosystem prosedurIs Arum
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsBrendan Gregg
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodBrendan Gregg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fred Moyer
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016Belmiro Moreira
 

What's hot (20)

Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspective
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
 
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in DetailOverlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
Biosystem prosedur
Biosystem prosedurBiosystem prosedur
Biosystem prosedur
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Big Data Management at CERN: The CMS Example
Big Data Management at CERN: The CMS ExampleBig Data Management at CERN: The CMS Example
Big Data Management at CERN: The CMS Example
 
Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016
 

Similar to OSMC 2012 | Monitoring at CERN by Christophe Haen

C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007Andrea PETRUCCI
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring Tim Bell
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Hpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challengeHpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challengeJason Shih
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAnalytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAzeem Iqbal
 
Multi-Cloud Federated Kubernetes at CERN
Multi-Cloud Federated Kubernetes at CERNMulti-Cloud Federated Kubernetes at CERN
Multi-Cloud Federated Kubernetes at CERNRicardo Rocha
 
Computing Challenges at the Large Hadron Collider
Computing Challenges at the Large Hadron ColliderComputing Challenges at the Large Hadron Collider
Computing Challenges at the Large Hadron Colliderinside-BigData.com
 
1005 cern-active mq-v2
1005 cern-active mq-v21005 cern-active mq-v2
1005 cern-active mq-v2James Casey
 
London Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNLondon Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNCeph Community
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015Karel Ha
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
Big Fast Data in High-Energy Particle Physics
Big Fast Data in High-Energy Particle PhysicsBig Fast Data in High-Energy Particle Physics
Big Fast Data in High-Energy Particle PhysicsAndrew Lowe
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterIvan Babrou
 
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with StylePuppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with StylePuppet
 

Similar to OSMC 2012 | Monitoring at CERN by Christophe Haen (20)

C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Hpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challengeHpc, grid and cloud computing - the past, present, and future challenge
Hpc, grid and cloud computing - the past, present, and future challenge
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAnalytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
 
Multi-Cloud Federated Kubernetes at CERN
Multi-Cloud Federated Kubernetes at CERNMulti-Cloud Federated Kubernetes at CERN
Multi-Cloud Federated Kubernetes at CERN
 
OCRE webinar - April 14 - Cloud_Validation_Suite_Ignacio Peluaga Lozada.pdf
OCRE webinar - April 14 - Cloud_Validation_Suite_Ignacio Peluaga Lozada.pdfOCRE webinar - April 14 - Cloud_Validation_Suite_Ignacio Peluaga Lozada.pdf
OCRE webinar - April 14 - Cloud_Validation_Suite_Ignacio Peluaga Lozada.pdf
 
Computing Challenges at the Large Hadron Collider
Computing Challenges at the Large Hadron ColliderComputing Challenges at the Large Hadron Collider
Computing Challenges at the Large Hadron Collider
 
1005 cern-active mq-v2
1005 cern-active mq-v21005 cern-active mq-v2
1005 cern-active mq-v2
 
Lxcloud
LxcloudLxcloud
Lxcloud
 
London Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNLondon Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERN
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
Big Fast Data in High-Energy Particle Physics
Big Fast Data in High-Energy Particle PhysicsBig Fast Data in High-Energy Particle Physics
Big Fast Data in High-Energy Particle Physics
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
 
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with StylePuppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
 

Recently uploaded

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 

OSMC 2012 | Monitoring at CERN by Christophe Haen

  • 1. Monitoring at CERN Christophe Haen christophe.haen@cern.ch Universit´e Blaise Pascal, CLERMONT-FERRAND cedex, FR LHCb Online team (CERN), GENEVA, CH 17th October 2012
  • 2. Disclaimer Disclaimer This presentation covers only a very small fraction of the monitoring done at CERN It is not intended to be a benchmark or a comparison of the solutions I am not an expert of all the software
  • 3. Plan 1 CERN Overview The LHC project Global picture 2 Monitoring at CERN LHC Experiments LHCb ATLAS CMS Alice IT Grid 3 Outlooks 4 Conclusion
  • 4. What people know about CERN Glass Cathedral World destroyer machines Crazy physicists Monitoring at CERN 1 C. Haen
  • 5. What CERN really is CERN in a nutshell European Organization for Nuclear Research Founded in 1954, near Geneva, Switzerland Research, Technology, Collaborating and Education Provides infrastructures needed for high-energy physics (HEP) research One of the world’s largest and most respected centers for scientific research Monitoring at CERN 2 C. Haen
  • 6. What CERN really is Monitoring at CERN 3 C. Haen
  • 7. LHC One Ring to rule them all World’s largest and highest-energy particle accelerator 27 kilometers in circumference,≈ 100 m underground, 14 TeV, 11k turns per second, 40 MHz collision rate, -271 ◦C... Answer some of the fundamental open questions in physics Monitoring at CERN 4 C. Haen
  • 10. Alice Alice A Large Ion Collider Experiment Quark-gluon plasma 26m long, 16m high, 16m wide, 10k tonnes Monitoring at CERN 7 C. Haen
  • 11. Atlas Atlas A Toroidal LHC ApparatuS General-purpose detector : Higgs boson, extra dimension, dark matter 46 m long, 25 m high and 25 m wide, 7k tonnes Monitoring at CERN 8 C. Haen
  • 12. CMS CMS Compact Muon Solenoid General-purpose detector : same goal as Atlas but different technical solutions 21 m long, 15 m wide and 15 m high, 12 500 tonnes Monitoring at CERN 9 C. Haen
  • 13. LHCb LHCb Large Hadron Collider beauty experiment Studies the origin of the difference between matter and anti-matter 21m long, 10m high and 13m wide, 5600 tonnes Monitoring at CERN 10 C. Haen
  • 14. Worldwide LHC Computing Grid Grid Experiments produce 25 petabytes of data per year 100k processors, 170 sites, 36 countries Monitoring at CERN 11 C. Haen
  • 15. Global picture In brief LHC produces data Experiments record data Grid analyzes data CERN operates all this Monitoring! All this requires IT infrastructures, and thus monitoring Monitoring at CERN 12 C. Haen
  • 16. CERN accelerator’s controls infrastructure accelerator’s controls infrastructure Control the proper behavior of the accelerators Large geographical distances Big diversity of equipment Equipment 1500 Servers PLC (Programmable logic controller) 300 fan trays VME crates Mostly Linux, including real time Monitoring at CERN 13 C. Haen
  • 17. Diamon Motivations Provide the accelerator operators with precise and easy to use tools to monitor the behavior of the Controls Infrastructure Allow for an easy access to diagnostic tools providing more details and help to solve an eventual problem Requirements Very low footprint on the client Highly configurable interface Monitoring at CERN 14 C. Haen
  • 18. C2 MON Requirements Stick to proven technologies Open-source resources when possible (always basic open-source fall-back option) Choices Java Spring. JMS middleware with ActiveMQ Oracle database but other DB possible iBatis java library for persistence Wrapper of Ehcache for the server cache Terracotta for cluster setup Monitoring at CERN 15 C. Haen
  • 19. C2 MON C2MON 3-tier architecture DAQ for a number of protocols/equip- ment Core designed to run in a clustered setup C2MON Client API for client communication Monitoring at CERN 16 C. Haen
  • 20. C2 MON DAQ One DAQ per ’type of check’ (Equipment) Many DAQ available (SNMP, SSH, CLIC...) New check using existing DAQ can be added on the fly (metric) React to configuration changes (Restart/on the fly) Filtering capabilities Monitoring at CERN 17 C. Haen
  • 21. C2 MON Core Modular architecture (custom plugins) Cluster or stand-alone configuration New server added/removed on the fly No downtime possibility Database backend cached Client Based on a common API Many views provided (video stream, web...) Replay functionality Monitoring at CERN 18 C. Haen
  • 22. Why do we need a computer infrastructure A few numbers LHC crossing rate of 40 MHz 600 million events per second kb < Event size < Mb ⇒ Need to filter the events Monitoring at CERN 19 C. Haen
  • 23. LHCb Environment 2k servers 200 switches 400 embedded processors Linux ( 90%) & Windows Farm Diskless server for processing : farm node 1 control server for ≈ 30 nodes (pxe boot, dhcp, NFS...) ≈ 60 control servers Monitoring at CERN 20 C. Haen
  • 24. LHCb Setup ICINGA 1.7.2 Ido2db with MySQL Mod gearman 1.3.8 Setup details 60 mod gearman workers NRPE, NSCLient++, SNMP Few custom checks (GPFS, filesystem speed,...) Monitoring at CERN 21 C. Haen
  • 25. ATLAS Environment 3k hosts with up to 40 checks Similar environment as LHCb No system monitoring of the frontend boards Current Setup (System monitoring only) 80 independent Nagios 2.5 customized Single MySQL cluster (ndoutils) and RRD storage (NAS) Custom web interface for overall status NRPE, IPMI sensors, SNMP 1 year DB : 8.5Gb 1 year RRD : 18 Gb Monitoring at CERN 22 C. Haen
  • 26. ATLAS Requirements Keep (some) configuration compatibility keep using the custom ConfDB for config generation Prospects Ganglia Icinga + mod gearman Monitoring at CERN 23 C. Haen
  • 27. CMS Environment 3k hosts 170 switches No diskless machine Current Setup (System monitoring only) 1 central Icinga instance 1 gearman worker 1 PNP4nagios server NRPE, check multi 90k checks every 2 minutes Use of JSON output for custom scripts Monitoring at CERN 24 C. Haen
  • 28. Alice HLT Environment 220 servers 63 switches FPGAs GPUs Monitoring Ganglia SysMES (home-made system) Monitoring at CERN 25 C. Haen
  • 29. Ganglia Distributed monitoring system Groups of clusters Unicast or Multicast communication RRDtool gmond, gmetad, ganglia-web Monitoring at CERN 26 C. Haen
  • 30. Monitoring at CERN 27 C. Haen
  • 31. SysMES Multi-layered, scalable, decentralized, fault tolerant, dynamic framework Industry standards : XML, EJB, JBOSS, CIM... Inventory module Monitoring Rule based tool set Monitoring at CERN 28 C. Haen
  • 32. IT IT department Databases, Virtual machines, Storage, Mail, Web, etc Linux, Windows, Mac 9k servers : 17k NIC, 70k cores, 70k disks, 205 Tb of memory 2600+ switches 3.5 MW facility Monitoring at CERN 29 C. Haen
  • 33. IT department Monitoring at CERN 30 C. Haen
  • 34. LEMON LHC Era Monitoring Part of the ELFms tool suite (Extremely Large Fabric management system) For Linux systems Server/client based Monitoring agent on each monitored node Server centralizes all the data (push/pull) Lemon-web, lemon-cli Alarm system Rule engine Monitoring at CERN 31 C. Haen
  • 35. LEMON Client Agent : start and configure sensors Sensor : implement metric classes metric class: given measure (e.g : CPU load) metric instance : given measure in given configuration Monitoring at CERN 32 C. Haen
  • 37. LEMON LEMON in the computer center 11k entities monitored 250 metric classes 1k unique metrics 1.7 million different informations 1 year DB : 6Tb 8 years rrd files : 45 Gb Monitoring at CERN 34 C. Haen
  • 38. WLCG WLCG Worldwide LHC Computing Grid 35 countries over the globe 170 sites Very heterogeneous environment Monitoring at CERN 35 C. Haen
  • 39. Service Availability Monitoring Fully distributed monitoring framework Based on open source systems High scalability Advanced notification and reporting system Web interface and REST API Monitoring at CERN 36 C. Haen
  • 40. Monitoring at CERN 37 C. Haen
  • 41. Multiple components Aggregated topology database Profile database Result database Messaging : ActiveMQ Probes Monitoring : Nagios Monitoring at CERN 38 C. Haen
  • 42. Setup 45 Nagios instances 800k records per day 6 month of history : 800 Gb Monitoring at CERN 39 C. Haen
  • 43. Agile Project Why CERN data center is reaching its limits Custom tools are high maintenance 2015 : 15k servers, 300k VMs needed Solution : become standard Monitoring at CERN 40 C. Haen
  • 44. Agile Project Monitoring >30 monitoring applications 40k producers 280Gb per day Technologies producers : e.g. lemon Aggregation : Appolo Processing : HBase Futur architecture Monitoring at CERN 41 C. Haen
  • 45. Atlas Automated and Intelligent Assistant Assistant Based on Esper Complex Event Processing Pattern recognition Time-based event correlation SQL like language Monitoring at CERN 42 C. Haen
  • 46. LHCb expert system Caution Still experimental Phronesis Linux system only Diagnostics & Recovery Automatic dependency discovery Reinforcement learning Experience sharing Monitoring at CERN 43 C. Haen
  • 47. LHCb expert system Experience sharing Learning speed comparison Monitoring at CERN 44 C. Haen
  • 48. Conclusion To conclude... Only a small fraction of monitoring at CERN presented Many aspects were not addressed A unique tool cannot do everything Aggregate tools and results Use the results More smartness Participate to the community Monitoring at CERN 45 C. Haen
  • 50. Acknowledgment Many Thanks to... Pedro Andrade, Sergio Ballestrero, Alastair Bland, Felix Ehm, Ivan Fedorko, Thorsten Kollegger, wojciech Lapka, Olivier Raginel, Adriana Telesca and Falco Vennedey for the interesting discussions and materials. Jorg Wenninger for the picture from the Jura. Monitoring at CERN 47 C. Haen
  • 51. Sources Sources Alice HLT monitoring: http://iopscience.iop.org/1742-6596/331/5/052003/ http://cdsweb.cern.ch/record/1454269 Atlas monitoring: http://cdsweb.cern.ch/record/1455464 http://cdsweb.cern.ch/record/1450129 C2MON : http://wikis.cern.ch/display/C2MON/C2MON+Home Cern Agile: https://twiki.cern.ch/twiki/pub/Main/TimBellPresentationList/20120524 CERN Data Centre Evolution v2.ppt Cern experiments: http://public.web.cern.ch/public/en/lhc/LHCExperimentsen.html Cern IT department: http://information-technology.web.cern.ch/ CMS system monitoring: CHEP 2012 proceedings : Health And Performance Monitoring Of The Online Computer Cluster Of CMS, G. Bauer Esper: http://esper.codehaus.org/ Ganglia: http://ganglia.sourceforge.net/ Icinga: https://www.icinga.org/ Monitoring at CERN 48 C. Haen
  • 52. Sources Sources LEMON: http://lemon-monitoring.web.cern.ch/ nagios: http://www.nagios.org/ Phronesis: CHEP 2012 proceedings : Artificial intelligence in the service of system administrators, C. Haen SAM : http://tomtools.cern.ch/confluence/display/SAMWEB/Home http://gridmonitoring.cern.ch/mywlcg SysMES: http://wiki.kip.uni-heidelberg.de/ti/SysMES/index.php/Main Page Monitoring at CERN 49 C. Haen