He Pi Xii2003


Published on

Fashion, apparel, textile, merchandising, garments

Published in: Business, Lifestyle
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

He Pi Xii2003

  1. 1. The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003
  2. 2. Contents <ul><li>Introduction to CERN’s Fabric Management: Concepts </li></ul><ul><li>Framework for CERN’s Fabric Management: Tools </li></ul><ul><ul><li>Configuration Mgmt </li></ul></ul><ul><ul><li>Software Mgmt </li></ul></ul><ul><ul><li>State Mgmt </li></ul></ul><ul><ul><li>Monitoring </li></ul></ul>
  3. 3. Concepts: The Node <ul><li>The Node is the manageable unit: </li></ul><ul><li>Autonomous: </li></ul><ul><ul><li>Local configuration files </li></ul></ul><ul><ul><li>Programs work locally </li></ul></ul><ul><ul><li>No external dependencies </li></ul></ul><ul><ul><li>No remote management scripts </li></ul></ul><ul><li>Adheres to LSB (Linux Standard Base): </li></ul><ul><ul><li>Init scripts /etc/init.d/, start daemons </li></ul></ul><ul><ul><li>Logfile directory /var/log, logrotate </li></ul></ul><ul><ul><li>Config directory /etc </li></ul></ul><ul><ul><li>(System) Programs in /(s)bin/, /usr/(s)bin </li></ul></ul>
  4. 4. Concepts: Node -> Cluster <ul><li>Same functionality of nodes -> cluster (But not necessarily same HW) </li></ul><ul><li>Management tools enforce uniform setup </li></ul><ul><li>Cluster size varies: </li></ul><ul><ul><li>LXBATCH > 1000 nodes </li></ul></ul><ul><ul><li>LXPLUS ~ 70 nodes </li></ul></ul><ul><ul><li>LXMASTER (Batch master) = 2 nodes </li></ul></ul><ul><li>Critical servers replaced by service clusters with redundant nodes </li></ul>
  5. 5. Concepts: Principles <ul><li>Software installs/updates through RPM </li></ul><ul><li>Configuration through one tool </li></ul><ul><li>Configuration information through one interface </li></ul><ul><li>Configuration information stored centrally </li></ul><ul><li>Installation, configuration and maintenance automated, but steerable </li></ul><ul><li>Reproducibility </li></ul>
  6. 6. Framework node Mon Agent Monitoring Manager Cfg Agent Config Manager Config Cache SW Agent SW Manager SW Cache Hardware Manager State Manager
  7. 7. Framework node SW Agent Cfg Agent Mon Agent CDB Monitoring Manager SW Manager Hardware Manager State Manager CCM SW Cache
  8. 8. Configuration (CDB & CCM) <ul><li>CDB (Configuration Data Base): </li></ul><ul><li>Development of EU Data Grid (WP4) </li></ul><ul><li>CDB is the configuration data base </li></ul><ul><li>Now ~ 1500 nodes, ~ 15 clusters </li></ul><ul><li>~ 3200 configuration templates to describe the nodes </li></ul><ul><li>Creates one (XML) profile per node </li></ul><ul><li>All information that is needed to install & run the nodes now included </li></ul><ul><li>Currently 2 Linux versions: RH 7.3 & ES 2.1 </li></ul>
  9. 9. CDB (cont’d) <ul><li>Additional Information to be added: (Merged from other sources) </li></ul><ul><li>State information (->SMS) </li></ul><ul><li>Monitoring information (->MSA) </li></ul><ul><li>Vendor/Contract/Purchase information: </li></ul><ul><ul><li>Need for encryption to store secure data </li></ul></ul><ul><li>New, high level Interfaces are provided: </li></ul><ul><li>“Add/Rename Node” </li></ul><ul><li>Change node state </li></ul>
  10. 10. CDB (cont’d) <ul><li>Local caching on the node CCM (Configuration Cache Manager): </li></ul><ul><ul><li>In test phase, deployed on a few nodes </li></ul></ul><ul><ul><li>Runs local daemon, which is notified on modification of the nodes configuration information </li></ul></ul><ul><ul><li>Avoids peaks on CDB web servers </li></ul></ul><ul><li>Beside XML profiles, new SQL interface: </li></ul><ul><ul><li>Allows SQL queries on CDB </li></ul></ul><ul><ul><li>Needed for cross machine view (e.g. give me all nodes that belong to the cluster X) </li></ul></ul>
  11. 11. Framework node SPMA Cfg Agent Mon Agent CDB Monitoring Manager SWRep Hardware Manager State Manager CCM SWRep Cache
  12. 12. Software distribution (SPMA & SWRep) <ul><li>SPMA (Software Package Management Agent): </li></ul><ul><li>Development of EU Data Grid (WP4) </li></ul><ul><li>The tool to install all software on the nodes </li></ul><ul><ul><li>Uses RPM for SW distribution on Linux </li></ul></ul><ul><ul><li>Version for Solaris PKG package manager exists </li></ul></ul><ul><li>We install between 700 – 1000 RPMs per node </li></ul><ul><li>Based on RPMT (Enhancement of RPM) </li></ul><ul><li>Crucial part of the framework </li></ul>
  13. 13. SPMA (cont’d) <ul><li>SPMA runs on every node (on demand) </li></ul><ul><li>Can manage either a subset or all packages: </li></ul><ul><ul><li>We manage all packages on all clusters but one, which is for development </li></ul></ul><ul><ul><li>Missing packages are added and </li></ul></ul><ul><ul><li>Unknown packages are removed </li></ul></ul><ul><li>Package list created from CDB, but SPMA is independent of CDB </li></ul><ul><li>SPMA allows to roll back versions </li></ul>
  14. 14. SPMA & SWRep <ul><li>SWRep (Software Repository): </li></ul><ul><li>Client-Server tool suite for storage of software packages </li></ul><ul><li>Universal: </li></ul><ul><ul><li>Linux RPM/Solaris PKG </li></ul></ul><ul><ul><li>Multiple versions: RH 7.3, RH ES 2.1, RH 10 </li></ul></ul><ul><li>Management interface: </li></ul><ul><ul><li>ACL mechanism to add packages </li></ul></ul><ul><ul><li>Package list automatically kept up-to-date in CDB </li></ul></ul>
  15. 15. SPMA & SWRep (cont’d) <ul><li>Addresses Scalability: </li></ul><ul><li>HTTP as SW distribution protocol </li></ul><ul><li>Load balanced server cluster </li></ul><ul><li>SPMA run is randomly time delayed within 10 minutes </li></ul><ul><li>Pre-caching of SW packages on the node possible </li></ul><ul><li>Currently installed on 1500 nodes </li></ul>
  16. 16. Framework node SPMA NCM Mon Agent CDB Monitoring Manager SWRep Hardware Manager State Manager CCM SWRep Cache
  17. 17. Configuration Tool (NCM) <ul><li>NCM (Node Configuration Manager): </li></ul><ul><li>Local configuration tool </li></ul><ul><li>EU Data Grid (WP4) development </li></ul><ul><li>First components have been (re-)written and are tested on production nodes </li></ul><ul><li>Uses CDB for configuration information </li></ul><ul><li>Has its first public release: </li></ul><ul><ul><li>We have to transform all our SUE features into NCM components (~50) </li></ul></ul><ul><ul><li>Plan is to do this while migrating to next Linux release </li></ul></ul>
  18. 18. Framework node SPMA NCM MSA CDB OraMon SWRep CCM SWRep Cache Hardware Manager State Manager
  19. 19. Monitoring (MSA & OraMon) <ul><li>LEMON (LHC Era Monitoring): </li></ul><ul><li>EU Data Grid (WP4) development </li></ul><ul><li>Client (MSA): </li></ul><ul><ul><li>~ 100 metrics are measured </li></ul></ul><ul><ul><li>Deployed on > 1500 nodes (more than currently managed by CDB) </li></ul></ul><ul><ul><li>Configuration to be put into CDB </li></ul></ul><ul><li>Server (OraMon): </li></ul><ul><ul><li>ORACLE database as back end </li></ul></ul><ul><ul><li>Stores current values as well as history </li></ul></ul><ul><ul><li>User API (in C, PERL, PHP, TCL) in test phase </li></ul></ul>
  20. 20. Framework node SPMA NCM MSA CDB OraMon SWRep HMS SMS CCM SWRep Cache
  21. 21. State Management (SMS & HMS) <ul><li>LEAF (LHC Era Automated Fabric): </li></ul><ul><li>HMS (Hardware Management System), controls & tracks: </li></ul><ul><ul><li>Node installation </li></ul></ul><ul><ul><li>Node Move & reinstall (rename) </li></ul></ul><ul><ul><li>Node retirement </li></ul></ul><ul><ul><li>Node repairs (Vendor calls) </li></ul></ul><ul><li>Remedy Workflow Application </li></ul><ul><li>Will interface to CDB </li></ul>
  22. 22. HMS & SMS <ul><li>SMS (State Management System): </li></ul><ul><li>Allows to set node states (in CDB) </li></ul><ul><li>Validates state transition </li></ul><ul><li>Handles new machine arrivals (~400 in Nov) </li></ul><ul><li>Uses SOAP to interface to CDB </li></ul><ul><li>Working prototype </li></ul>
  23. 23. Tools: node SPMA NCM MSA CDB OraMon SWRep CCM SWRep Cache HMS SMS QUATTOR LEMON LEAF = + +
  24. 24. Tools: Examples <ul><li>Batch System LSF: </li></ul><ul><ul><li>Upgrade 4.2 -> 5.1 on > 1000 nodes within 15 min, without stopping batch (with pre-caching) </li></ul></ul><ul><li>Kernel Upgrade: </li></ul><ul><ul><li>SPMA can handle multiple versions of the same package: </li></ul></ul><ul><ul><li>Allows to separate installation and reboot of new kernel in time </li></ul></ul><ul><li>Security upgrades: </li></ul><ul><ul><li>All security upgrades are done by SPMA (~once a week): </li></ul></ul><ul><ul><ul><li>SSH Security upgrade </li></ul></ul></ul><ul><ul><ul><li>KDE upgrade (~400 MB per node) </li></ul></ul></ul>
  25. 25. References <ul><li>EU Data Grid: http://www. eu - datagrid .org </li></ul><ul><li>EDG WP4: http://cern.ch/hep-proj-grid-fabric </li></ul><ul><li>QUATTOR web page: http:// quattor .org </li></ul><ul><li>LEMON web page: http:// cern . ch /lemon </li></ul><ul><li>LEAF web page: http:// cern . ch /leaf </li></ul><ul><li>CERN IT/FIO: http:// cern . ch /it-div- fio </li></ul>