Successfully reported this slideshow.
James Casey, CERN, IT-GT-TOM 1 st  ROC LA Workshop, 6 th  October 2010 Grid Infrastructure Monitoring
Tools for WLCG Monitoring <ul><li>WLCG provides a set of tools for operational monitoring and management </li></ul><ul><li...
Tools <ul><li>GOCDB </li></ul><ul><ul><li>Configuration management </li></ul></ul><ul><li>SAM/Nagios </li></ul><ul><ul><li...
Tools <ul><li>I will talk about most of the previous tools </li></ul><ul><ul><li>SAM/Nagios, GStat, Gridview, Gridmap </li...
Open-source at the core – Avoid NIH ! <ul><li>All these tools depend on common low-level components </li></ul><ul><ul><li>...
Nagios <ul><li>What is Nagios? </li></ul><ul><ul><li>open source monitoring framework </li></ul></ul><ul><ul><li>highly fl...
Nagios Architecture <ul><li>Nagios Core </li></ul><ul><ul><li>Scheduler:  Runs checks at a predefined interval </li></ul><...
Nagios Web Interface
Site Nagios – CE Tests
Messaging <ul><li>What is a messaging system? </li></ul><ul><ul><li>Method of communication between applications </li></ul...
JMS messaging models
Why messaging ? <ul><li>Why do we need it? </li></ul><ul><ul><li>Interaction between distributed monitoring components </l...
Implementation details <ul><li>FUSE Message Broker  </li></ul><ul><ul><li>based on Apache ActiveMQ </li></ul></ul><ul><li>...
Vendor tests From  “Optimizing FUSE Message Broker” -  http://open.iona.com/resources/collateral/#whitepapers
CERN openlab tests
Messaging is a key technology for WLCG <ul><li>WLCG Experiments are buying into messaging </li></ul><ul><ul><li>ATLAS DDM ...
SAM and Nagios <ul><li>Service Availability Monitoring (SAM) </li></ul><ul><ul><li>A distributed monitoring system </li></...
Nagios at a Site <ul><li>Simplest model </li></ul><ul><ul><li>A site wants fabric monitoring for the grid services </li></...
Nagios Web Interface
Site Nagios – CE Tests
Nagios at the region <ul><li>A NGI or ROC monitors all it ’s sites </li></ul><ul><ul><li>“ Simulates users actions via the...
Architecture - Regions
Architecture
Current Status <ul><li>27 national level Nagios servers </li></ul><ul><ul><li>Should grow out to full WLCG scale in next f...
MyEGI homepage
MyEGI heatmap view
MyEGI Services view
MyEGI service status drilldown
Computation of Availability Metrics <ul><li>Gridview computes Service Availability Metrics per VO using SAM test results  ...
Metric Computation Example <ul><li>Consider Status for a day as </li></ul><ul><ul><li>UP – 12 Hrs </li></ul></ul><ul><ul><...
Service & Site Service Status Calculation test status  per (test, si, vo) Test Results Service Instance Status  ServiceSta...
Gridview – Site availability details
Gridview – ROC report
Gridview – ROC Drilldown
Gridview – Data Transfers
Gstat – Information System visualization <ul><li>Information system contains the middleware view of the infrastructure </l...
Gstat – LDAP Browser
Gstat – ROC Summary
Gstat - Site View drilldown
WLCG Topology view
GridMap  Visualization site regions Size of rectangle is e.g. - size of site (#CPUs) - #running jobs - ... <ul><li>Idea </...
GridMap  Visualization <ul><li>Idea </li></ul><ul><ul><li>visualize the Grid by using  Treemaps </li></ul></ul><ul><ul><li...
Multiple Views <ul><li>GridMaps  can be used for  top-level ,  geographical  and  VO  views </li></ul>VO Views cross-locat...
Trends <ul><li>Trends can be understood by looking at a sequence of  GridMaps </li></ul>25 Sep 2010 24 Sep 2010 23 Sep 201...
More Views <ul><li>Correlations of metrics can be discovered by switching between different views </li></ul>LHCb CMS Atlas...
Summary <ul><li>Wide range of tools available for you </li></ul><ul><li>Aim is to help you to manage your site </li></ul><...
Links <ul><ul><li>Nagios </li></ul></ul><ul><ul><ul><li>https://nagios.roc-la.org/nagios/ </li></ul></ul></ul><ul><ul><li>...
SAM Demo <ul><li>Watch our demo: </li></ul><ul><ul><li>http://tinyurl.com/EgeeSAM  (YouTube) </li></ul></ul><ul><ul><li>ht...
Questions ?
Upcoming SlideShare
Loading in …5
×

WLCG Grid Infrastructure Monitoring

1,817 views

Published on

An overview of the WLCG monitoring toolset, based on Nagios, ActiveMQ, Django, MySQL for monitoring the grid used by the Large Hadron Collider

Published in: Technology
  • Be the first to comment

  • Be the first to like this

WLCG Grid Infrastructure Monitoring

  1. 1. James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring
  2. 2. Tools for WLCG Monitoring <ul><li>WLCG provides a set of tools for operational monitoring and management </li></ul><ul><li>Aim is to </li></ul><ul><ul><li>Enable sites to operate a reliable infrastructure </li></ul></ul><ul><ul><li>Report on the reliability and usage to WLCG users </li></ul></ul><ul><li>Many of the tools developed previously within EGEE/OSG </li></ul><ul><ul><li>Now operated by EGI.eu, OSG, other NGIs </li></ul></ul>
  3. 3. Tools <ul><li>GOCDB </li></ul><ul><ul><li>Configuration management </li></ul></ul><ul><li>SAM/Nagios </li></ul><ul><ul><li>Checking the operational status of resources </li></ul></ul><ul><li>Gstat </li></ul><ul><ul><li>Information system monitoring and reporting </li></ul></ul><ul><li>Gridview </li></ul><ul><ul><li>Availability/reliability calculation and reporting </li></ul></ul><ul><li>Gridmap </li></ul><ul><ul><li>High level views of the infrastructure </li></ul></ul>
  4. 4. Tools <ul><li>I will talk about most of the previous tools </li></ul><ul><ul><li>SAM/Nagios, GStat, Gridview, Gridmap </li></ul></ul><ul><li>Other exist too </li></ul><ul><ul><li>Accounting – APEL </li></ul></ul><ul><ul><li>VO Cards - CIC Portal </li></ul></ul><ul><ul><li>1 st line support - Operations Dashboard </li></ul></ul><ul><li>And in OSG </li></ul><ul><ul><li>OIM, MyOSG, Gratia, … </li></ul></ul>
  5. 5. Open-source at the core – Avoid NIH ! <ul><li>All these tools depend on common low-level components </li></ul><ul><ul><li>Nagios – an open-source monitoring system </li></ul></ul><ul><ul><li>Apache ActiveMQ – an open source messaging system </li></ul></ul><ul><li>We many other open-source components when developing these tools </li></ul><ul><ul><li>Python, Django, Jquery, RRD, Google charts </li></ul></ul><ul><li>A short detour on what they are and why we use them </li></ul>
  6. 6. Nagios <ul><li>What is Nagios? </li></ul><ul><ul><li>open source monitoring framework </li></ul></ul><ul><ul><li>highly flexible with advanced features </li></ul></ul><ul><ul><li>widely used & actively developed </li></ul></ul><ul><li>Why do we need it? </li></ul><ul><ul><li>Many tests need to be scheduled for execution </li></ul></ul><ul><ul><li>avoid development & maintenance of home-grown tools </li></ul></ul><ul><ul><li>provide solution that site admins are familiar with </li></ul></ul><ul><ul><ul><li>Nagios is a standard monitoring component at many sites </li></ul></ul></ul>
  7. 7. Nagios Architecture <ul><li>Nagios Core </li></ul><ul><ul><li>Scheduler: Runs checks at a predefined interval </li></ul></ul><ul><li>Plugins </li></ul><ul><ul><li>Scripts used to check particular pieces of functionality </li></ul></ul><ul><li>Web interface </li></ul><ul><li>Powerful notification system </li></ul><ul><ul><li>E-mail, SMS, Pager, … </li></ul></ul><ul><li>All parts are pluggable and extensible </li></ul>
  8. 8. Nagios Web Interface
  9. 9. Site Nagios – CE Tests
  10. 10. Messaging <ul><li>What is a messaging system? </li></ul><ul><ul><li>Method of communication between applications </li></ul></ul><ul><ul><li>Standardized , asynchronous and scalable communication between distributed entities </li></ul></ul><ul><ul><li>Reliable network of brokers that provides guaranteed delivery of messages </li></ul></ul><ul><ul><li>Messaging is for applications what IM is for people </li></ul></ul><ul><ul><li>Mainly acts as an integration framework between many separate applications </li></ul></ul>
  11. 11. JMS messaging models
  12. 12. Why messaging ? <ul><li>Why do we need it? </li></ul><ul><ul><li>Interaction between distributed monitoring components </li></ul></ul><ul><ul><li>Standard interfaces enables easy integration of monitoring software </li></ul></ul><ul><ul><li>Scalable </li></ul></ul><ul><ul><ul><li>Main use-cases are in finance for high message rate ( > 1M/sec)reliable multicast e.g. trading floor </li></ul></ul></ul><ul><ul><li>Reliable </li></ul></ul><ul><ul><li>Distributed </li></ul></ul><ul><ul><li>Messaging is pre-existing grid-scale technology </li></ul></ul>
  13. 13. Implementation details <ul><li>FUSE Message Broker </li></ul><ul><ul><li>based on Apache ActiveMQ </li></ul></ul><ul><li>Good performance characteristics </li></ul><ul><ul><li>1K – 20K messages per second depending on features used </li></ul></ul><ul><li>Distributed network of 4 brokers, hosted by EGI.eu </li></ul><ul><ul><li>CERN, Croatia, Greece </li></ul></ul><ul><ul><li>Provides reliability and locality </li></ul></ul>
  14. 14. Vendor tests From “Optimizing FUSE Message Broker” - http://open.iona.com/resources/collateral/#whitepapers
  15. 15. CERN openlab tests
  16. 16. Messaging is a key technology for WLCG <ul><li>WLCG Experiments are buying into messaging </li></ul><ul><ul><li>ATLAS DDM </li></ul></ul><ul><ul><ul><li>Moving to a production messaging service </li></ul></ul></ul><ul><ul><li>Ganga </li></ul></ul><ul><ul><li>VO Job monitoring </li></ul></ul><ul><ul><li>Alice data transfers </li></ul></ul><ul><li>dCache can use it for distributed pools </li></ul><ul><ul><li>Developments by NDGF </li></ul></ul><ul><li>CERN Beams use it for monitoring in the control room </li></ul>
  17. 17. SAM and Nagios <ul><li>Service Availability Monitoring (SAM) </li></ul><ul><ul><li>A distributed monitoring system </li></ul></ul><ul><ul><li>Based on open-source components </li></ul></ul><ul><ul><ul><li>Nagios for test execution </li></ul></ul></ul><ul><ul><ul><li>ActiveMQ for communication via messaging </li></ul></ul></ul><ul><ul><li>With custom visualization </li></ul></ul><ul><ul><ul><li>MyEGEE/MyWGI/MyWLCG/... </li></ul></ul></ul><ul><li>Aims to test all resources on the production grid </li></ul><ul><ul><li>And provides data to other components for availability and reliability calculation </li></ul></ul>
  18. 18. Nagios at a Site <ul><li>Simplest model </li></ul><ul><ul><li>A site wants fabric monitoring for the grid services </li></ul></ul><ul><li>Download ‘EGEE-Nagios’ meta-package </li></ul><ul><li>Configure it as a site Nagios </li></ul><ul><ul><li>Point at your site BDII </li></ul></ul><ul><ul><li>Give it a certificate & email of local administrator </li></ul></ul><ul><li>Nagios now will test all resources in your site </li></ul><ul><ul><li>Mail admin list on errors </li></ul></ul><ul><ul><li>Provides web interface for more details </li></ul></ul><ul><ul><li>Detailed low-level tests for all services </li></ul></ul>
  19. 19. Nagios Web Interface
  20. 20. Site Nagios – CE Tests
  21. 21. Nagios at the region <ul><li>A NGI or ROC monitors all it ’s sites </li></ul><ul><ul><li>“ Simulates users actions via the public interfaces” </li></ul></ul><ul><ul><li>At a higher level than the site monitoring </li></ul></ul><ul><li>Allows regional operations to help manage the site </li></ul><ul><li>Feeds into availability calculations </li></ul><ul><li>Feeds back into the site monitoring </li></ul><ul><ul><li>You see the view the ROC has of you </li></ul></ul><ul><ul><li>And it can trigger local alerts into the operational process </li></ul></ul>
  22. 22. Architecture - Regions
  23. 23. Architecture
  24. 24. Current Status <ul><li>27 national level Nagios servers </li></ul><ul><ul><li>Should grow out to full WLCG scale in next few months </li></ul></ul><ul><li>Clients distributed across 40 countries </li></ul><ul><li>315 sites </li></ul><ul><li>5K services </li></ul><ul><li>500,000 test results/day </li></ul><ul><li>5 consumers of full data stream to database for analysis and post processing </li></ul>
  25. 25. MyEGI homepage
  26. 26. MyEGI heatmap view
  27. 27. MyEGI Services view
  28. 28. MyEGI service status drilldown
  29. 29. Computation of Availability Metrics <ul><li>Gridview computes Service Availability Metrics per VO using SAM test results </li></ul><ul><li>Computed Metrics include </li></ul><ul><ul><li>Service Status, Availability, Reliability </li></ul></ul><ul><li>All Metrics are computed </li></ul><ul><ul><li>per Service Instance, per Service (eg. CE) for a site </li></ul></ul><ul><ul><li>per Site, Aggregate of all Tier-1/0 sites </li></ul></ul><ul><li>Various periodicities like Hourly, Daily, Weekly and Monthly </li></ul><ul><li>Also shows: </li></ul><ul><ul><li>statistics of data transfers, FTS file transfers, jobs running </li></ul></ul>
  30. 30. Metric Computation Example <ul><li>Consider Status for a day as </li></ul><ul><ul><li>UP – 12 Hrs </li></ul></ul><ul><ul><li>Scheduled Down – 6 Hrs </li></ul></ul><ul><ul><li>Unknown – 6 Hrs </li></ul></ul><ul><li>Availability Graphs (1 st bar in Graph) would show </li></ul><ul><ul><li>Availability (Green) – 50 % </li></ul></ul><ul><ul><li>Sch. Down (Yellow) – 25 % </li></ul></ul><ul><ul><li>Unknown (Grey) – 25 % </li></ul></ul><ul><li>Reliability Graph (1 st bar in Graph) would show 100% </li></ul><ul><li>Reliability = Availability(Green) / (Availability(Green)+ Unscheduled Downtime(Red)) </li></ul><ul><li>Reliability not affected by Scheduled Downtime or Unknown Interval </li></ul>Sample Reliability Graph Sample Availability Graph
  31. 31. Service & Site Service Status Calculation test status per (test, si, vo) Test Results Service Instance Status ServiceStatus SiteStatus aggregate test status per (si, vo) Service = a service type (e.g. CE, SE, sBDII, ...) Serviceinstance (si) = ( service , node ) combination consider only critical tests for a vo ANDing Service marked as scheduled down (sd)  sd all test statuses are ok  up at least one test status is down(failed)  down No test status down and at least one test status is unknown  unknown aggregate service instance status for site services per (site, service, vo) ORing At least one service instance status up  up No instance up and at least one is sd  sd No instance up or sd and at least one instance is down  down All instances are unknown  unknown aggregate site service status per (site, vo) ANDing all service statuses up  up at least one service status down  down no service down and at least one is sd  sd no service down or sd and at least one is unknown  unknown https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf
  32. 32. Gridview – Site availability details
  33. 33. Gridview – ROC report
  34. 34. Gridview – ROC Drilldown
  35. 35. Gridview – Data Transfers
  36. 36. Gstat – Information System visualization <ul><li>Information system contains the middleware view of the infrastructure </li></ul><ul><li>Main usage: </li></ul><ul><ul><li>Service Discovery – what is there? </li></ul></ul><ul><ul><li>Installed Capacity – how much is there ? </li></ul></ul><ul><ul><li>VO Views – what can a VO use ? </li></ul></ul><ul><li>Gstat provides visual representation of this </li></ul><ul><ul><li>Management tool for NGI/WLCG managers </li></ul></ul><ul><ul><li>Debugging tool for site admins </li></ul></ul>
  37. 37. Gstat – LDAP Browser
  38. 38. Gstat – ROC Summary
  39. 39. Gstat - Site View drilldown
  40. 40. WLCG Topology view
  41. 41. GridMap Visualization site regions Size of rectangle is e.g. - size of site (#CPUs) - #running jobs - ... <ul><li>Idea </li></ul><ul><ul><li>visualize the Grid by using Treemaps </li></ul></ul><ul><ul><li>(Grid + Treemap = GridMap ) </li></ul></ul><ul><li>Example GridMap </li></ul>
  42. 42. GridMap Visualization <ul><li>Idea </li></ul><ul><ul><li>visualize the Grid by using Treemaps </li></ul></ul><ul><ul><li>(Grid + Treemap = GridMap ) </li></ul></ul><ul><li>Example GridMap </li></ul>Colour of rectangle is e.g. - SAM status of site / service - Availability of site / service - ... ok degraded down
  43. 43. Multiple Views <ul><li>GridMaps can be used for top-level , geographical and VO views </li></ul>VO Views cross-location Top-level View Geographical Views Federation, Partner, Site, etc. Next level of GridMaps Large-scale Federated Grid Services Infrastructure Global GridMap Application Domain GridMap Local GridMap Local GridMap Local GridMap Alert Corrective action effect
  44. 44. Trends <ul><li>Trends can be understood by looking at a sequence of GridMaps </li></ul>25 Sep 2010 24 Sep 2010 23 Sep 2010 Site Availability over time: 22 Sep 2010 21 Sep 2010 20 Sep 2010
  45. 45. More Views <ul><li>Correlations of metrics can be discovered by switching between different views </li></ul>LHCb CMS Atlas Alice OPS Site Availability from different VO perspectives: site BDII SRM SE CE Overall Site Status of different Site Services: sites without colour do not support the VO
  46. 46. Summary <ul><li>Wide range of tools available for you </li></ul><ul><li>Aim is to help you to manage your site </li></ul><ul><li>Integrates well with the Glite middleware and WLCG operational processes </li></ul><ul><li>The future leads towards better integrated portals for complete monitoring of your systems </li></ul><ul><ul><li>All open source </li></ul></ul><ul><ul><li>Contributions always welcome !!! </li></ul></ul>
  47. 47. Links <ul><ul><li>Nagios </li></ul></ul><ul><ul><ul><li>https://nagios.roc-la.org/nagios/ </li></ul></ul></ul><ul><ul><li>MyEGEE </li></ul></ul><ul><ul><ul><li>https://nagios.roc-la.org/myegee/ </li></ul></ul></ul><ul><ul><li>Gstat </li></ul></ul><ul><ul><ul><li>https://gstat-prod.cern.ch/ </li></ul></ul></ul><ul><ul><ul><li>https://gstat-wlcg.cern.ch/apps/topology/ </li></ul></ul></ul><ul><ul><li>Gridview Availability </li></ul></ul><ul><ul><ul><li>http://gridview.cern.ch/GRIDVIEW/same_index.php </li></ul></ul></ul><ul><ul><ul><li>https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf </li></ul></ul></ul><ul><ul><li>Gridmap </li></ul></ul><ul><ul><ul><li>http://gridmap.cern.ch/gm/ </li></ul></ul></ul>
  48. 48. SAM Demo <ul><li>Watch our demo: </li></ul><ul><ul><li>http://tinyurl.com/EgeeSAM (YouTube) </li></ul></ul><ul><ul><li>http://www.youtube.com/watch?v=PADq2x8q0kw </li></ul></ul>
  49. 49. Questions ?

×