Session 33 - Production Grids


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Yellow – gLite, Green – externally supported components, gLite consortium
  • Session 33 - Production Grids

    1. 1. Overview of Production Grids<br />Steven Newhouse<br />
    2. 2. Contents<br />Open Science Grid<br />DEISA<br />NAREGI<br />Nordic DataGrid Facility<br />EGEE<br />TeraGrid<br />EGI<br />
    3. 3. Open Science Grid<br />Ruth Pordes<br />
    4. 4. Open Science Grid<br />Consortium - &gt;100 member organizations contributing resources, software, applications, services. <br />Project:<br />Funded by DOE and NSF to deliver to the OSG Consortium for 5 years 2006-2011, 33FTEs. <br />VO science deliverables are OSG’s milestones.<br />Collaboratively focused: Partnerships, international connections, multidisciplinary<br />Satellites - independently funded projects contributing to the OSG Consortium program and vision: <br />CI-Team User and Campus Engagement, <br />VOSS study of Virtual Organizations, <br />CILogin Integration of end point Shibboleth identity management into the OSG infrastructure, <br />Funding for students to the International Summer School for Grid Computing 2009<br />
    5. 5. TeraGrid&apos;09 (Jun. 23, 2009)<br />Paul Avery<br />5<br />OSG & Internet2 Work Closely w/ Universities<br />~100 compute resources<br />~20 storage resources<br />~70 modules in software stack<br />~35 User Communities (VOs)<br />600,00-900,000 CPUhours/day,<br />200K-300K jobs/day,<br />&gt;2000 users<br />~5 other infrastructures<br />~25 resource sites. <br />2<br />5<br />
    6. 6. Users<br />Nearly all applications High Throughput. Small number of users starting MPI production use. <br />Major accounts: US LHC, LIGO<br /> ATLAS, CMS &gt;3000 physicists each.<br /> US ATLAS & US CMS Tier-1, 17 Tier-2s and new focus on Tier-3s (~35 today, expect ~70 in a year)<br /> ALICE taskforce to show usability of OSG infrastructure for their applications.<br /> LIGO Einstein@Home <br />US Physics Community:<br /> Tevatron - CDF & D0 FNAL and remote sites. <br /> Other Fermilab users – Neutrino, astro, simulation, theory<br /> STAR<br /> IceCube<br />Non-Physics:<br /> ~6% of usage. ~25 single PIs or small groups from biology, molecular dynamics, chemitry, weather forecasting, mathematics, protein prediction. <br /> campus infrastructures: ~7 including universities and labs.<br />
    7. 7. Non-physics use highly cyclic<br />
    8. 8. Operations<br />All hardware contributed by members of the Consortium<br />Distributed operations infrastructure including security, monitoring, registration, accounting services etc. <br />Central ticketing system, 24x7 problem reporting and triaging at the Grid Operations Center. <br />Distributed set of Support Centers as first line of support for VOs, services (e.g. software) and Sites.<br />Security incident response teams include Site Security Administrators and VO Security Contacts.<br />Software distribution, patches (security) and update.<br />Targetted Production, Site and VO support teams.<br />
    9. 9. OSG Job Counts (2008-9)<br />TeraGrid&apos;09 (Jun. 23, 2009)<br />Paul Avery<br />9<br />100M Jobs<br />300K jobs/day<br />
    10. 10. Software<br />OSG Virtual Data Toolkit packaged, tested, distributed, supported software stack used by multiple projects – OSG, EGEE, NYSGrid, TG, APAC, NGS.<br />~70 components covering Condor, Globus, security infrastructure, data movement, storage implementations, job management and scheduling, network monitoring tools, validation and testing, monitoring/accounting/information, needed utilities such as Apache, Tomcat; <br />Server, User Client, Worker-Node/Application Client releases. <br />Build and regression tested using U of Wisconsin Madison Metronome system.<br />Pre-release testing on 3 “VTB” sites – UofC, LBNL, Caltech<br />Post-release testing of major releases on Integration Testbed<br />Distributed team at Uof Wisconsin, Fermilab, LBNL.<br />Improved support for incremental upgrades in OSG 1.2 release summer ’09.<br />OSG configuration and validation scripts distributed to use the VDT. <br />OSG does not develop software except for tools and contributions (extensions) to external software projects delivering to OSG stakeholder requirements. <br />Identified liaisons provide bi-directional support and communication between OSG and External Software Provider projects.<br />OSG Software Tools Group oversees all software developed within the project.<br />Software vulnerability and auditing processes in place.<br />
    11. 11. VDT Progress (1.10.1 Just Released)<br />TeraGrid&apos;09 (Jun. 23, 2009)<br />Paul Avery<br />11<br />~ 70 components<br />
    12. 12. Partnerships and Collaborations<br />Partnerships with network fabric and identity service providers – ESNET, Internet2<br />Continuing bridging work with EGEE, SuraGrid, TeraGrid.<br />~17 points of contact/collaboration with EGEE and WLCG.<br />Partnership statement for EGI/NGIs.<br />Emerging collaborations with TG on Workforce Training, Software, Security.<br />Creator(co-sponsor) of successful e-weekly (International) Science Grid This Week.<br />Co-sponsor of this iSSGC’09 school.<br />Member of Production Infrastructure Policy Group (OGF affiliated).<br />
    13. 13. Community Collaboratories<br />13<br />Community Collaboratory<br />
    14. 14. DEISA Advancing Science in Europe<br />H. Lederer, A. Streit, J. Reetz - DEISA<br />RI-222919<br /><br />
    15. 15. DEISA consortium and partners<br />Eleven Supercomputing Centres in EuropeBSC, CSC, CINECA, ECMWF, EPCC, FZJ, HLRS, IDRIS, LRZ, RZG, SARA<br />Four associated partners: CEA, CSCS, JSCC, KTH<br />July 2009<br />H. Lederer, A. Streit, J. Reetz - DEISA<br />15<br />RI-222919<br />Co-Funded <br />by the <br />European Commission<br />DEISA2 <br />contract <br />RI-222919<br />15<br />
    16. 16. Infrastructure and Services<br />HPC infrastructure with heterogeneous resources <br />State-of-the-art supercomputer<br />Cray XT4/5, Linux<br />IBM Power5, Power6, AIX / Linux<br />IBM BlueGene/P, Linux <br />IBM PowerPC, Linux<br />SGI ALTIX 4700 (Itanium2 Montecito), Linux<br />NEC SX8/9 vector systems, Super UX<br />More than 1 PetaFlop/s of aggregated peak performance<br />Dedicated network, 10 Gb/s links provided by GEANT2 and NRENs<br />Continental shared high-performance filesystem (GPFS-MC, IBM)<br />HPC systems are owned and operated by national HPC centres<br />DEISA services are layered and operated on top<br />Fixed fractions of the HPC resources are dedicated for DEISA<br />Europe-wide coordinated expert teams for operation, technology developments, and application enabling and support<br />July 2009<br />H. Lederer, A. Streit, J. Reetz - DEISA<br />16<br />RI-222919<br />16<br />
    17. 17. July 2009<br />H. Lederer, A. Streit, J. Reetz - DEISA<br />17<br />HPC resource usage <br />RI-222919<br />HPC Applications<br />from various scientific fields: astrophysics, earth sciences, engineering, life sciences, materials sciences, particle physics, plasma physics<br />require capability computing facilities (low latency, high throughput interconnect), often application enabling and support<br />Resources granted through:<br /> - DEISA Extreme Computing Initiative (DECI, annual calls)<br />DECI call 2008<br />42 proposals accepted 50 mio CPU-h granted*<br />DECI call 2009 (proposals currently under review)<br />75 proposals more than 200 mio CPU-h requested* *) normalized to IBM P4+<br />Over 160 universities and research institutes from 15 European countries with co-investigators from four other continents have already benefitted<br />- Virtual Science Community Support<br />2008: EFDA, EUFORIA, VIROLAB<br />2009: EFDA, EUFORIA, ENES, LFI-PLANCK, VPH/VIROLAB, VIRGO<br />17<br />
    18. 18. Middleware <br />Various services are provided on the middleware layer:<br />DEISA Common Production Environment (DCPE)<br />(Homogeneous software environment layer for heterogeneous HPC platforms)<br />High performance data stage-in/-out to GPFS: GridFTP<br />Workflow management: UNICORE<br />Job submission: <br />UNICORE<br />WS-GRAM (optional) <br />Interactive usage of local batch systems <br />remote job submission between IBM P6/AIX systems (LL-MC)<br />Monitoring System: INCA<br />Unified AAA: distributed LDAP and resource usage data bases<br />Only few software components are developed within DEISA<br />Focus on technology evaluation, deployment and operation<br />Bugs are reported to the software maintainers<br />July 2009<br />H. Lederer, A. Streit, J. Reetz - DEISA<br />18<br />RI-222919<br />18<br />
    19. 19. Standards<br />DEISA has a vital interest in the standardization of interfaces to HPC services<br />Job submission, job and workflow management, data management, data access and archiving, networking and security (including AAA)<br />DEISA supports OGF standardization groups<br />JSDL-WG and OGSA-BES for job submission, <br />UR-WG and RUS-WG for accounting<br />DAIS for data services<br />Engagement in Production Grid Infrastructure WG <br />DEISA collaboration in standardization with other projects<br />GIN community <br />Infrastructure Policy Group (DEISA, EGEE, TeraGrid, OSG, NAREGI)<br />Goal: Achievement of seamless interoperation of leading Grid Infrastructures worldwide<br /> - Authentication, Authorization, Accounting (AAA)<br /> - Resource allocation policies<br /> - Portal / access policies <br />July 2009<br />H. Lederer, A. Streit, J. Reetz - DEISA<br />19<br />RI-222919<br />19<br />
    20. 20. Status of CSI Grid (NAREGI)<br />Kento Aida<br />National Institute of Informatics<br />
    21. 21. Overview<br />Current Status<br />We started pilot operation in May 2009.<br />Organization<br />Computer centers in 9 universities<br />resource provider<br />National Institute of Informatics<br />network provider (SINET 3) and GOC<br />Funding<br />organizations’ own funding<br />Kento Aida, National Institute of Informatics<br />21<br />
    22. 22. Operational Infrastructure<br />Kento Aida, National Institute of Informatics<br />22<br />
    23. 23. Middleware<br />NAREGI middleware Ver. 1.1.3 <br />developer<br />National Institute of Informatics<br /> ( )<br />platform<br />CentOS 5.2 + PBS Pro 9.1/9.2<br />OpenSUSE 10.3 + Sun Grid Engine v6.0<br />Kento Aida, National Institute of Informatics<br />23<br />
    24. 24. Nordic DataGrid Facility<br />Michael Gronager<br />
    25. 25. NDGF Organization<br /><ul><li>A Co-operative Nordic Data and Computing Grid facility
    26. 26. Nordic production grid, leveraging national grid resources
    27. 27. Common policy framework for Nordic production grid
    28. 28. Joint Nordic planning and coordination
    29. 29. Operate Nordic storage facility for major projects
    30. 30. Co-ordinate & host major eScience projects (i.e., Nordic WLGC Tier-1)‏
    31. 31. Contribute to grid middleware and develop services
    32. 32. NDGF 2006-2010
    33. 33. Funded (2 M€/year) by National Research Councils of the Nordic Countries</li></ul>25<br />NOS-N<br />IS<br />DK<br />SE<br />FI<br />NO<br />Nordic Data Grid Facility<br />
    34. 34. NDGF Facility - 2009Q1<br />
    35. 35. NDGF People - 2009Q2<br />
    36. 36. Application Communities<br /><ul><li>WLCG – the Worldwide Large Hadron Collider Grid
    37. 37. Bio-informatics sciences
    38. 38. Screening of CO2-Sequestration suitable reservoirs
    39. 39. Computational Chemistry
    40. 40. Material Science
    41. 41. And the more horizontal:
    42. 42. Common Nordic User Administration,
    43. 43. Authentication,
    44. 44. Authorization &
    45. 45. Accounting</li></ul>28<br />
    46. 46. Operations<br /><ul><li>Operation team of 5-7 people
    47. 47. Collaboration btw. NDGF and SNIC and NUNOC
    48. 48. Expert 365 days a year
    49. 49. 24x7 by Regional REN
    50. 50. Distributed over the Nordics
    51. 51. Runs:
    52. 52. rCOD + ROC – for Nordic + Baltic
    53. 53. Distributed Sites (T1, T2s)
    54. 54. Sysadmins well known by the operation team
    55. 55. Continuous chatroom meetings</li></ul>29<br />
    56. 56. Middleware<br /><ul><li>Philosophy:
    57. 57. We need tools to run an e-Infrastructure.
    58. 58. Tools cost: money / in kind.
    59. 59. In kind means Open Source tools
    60. 60. – hence we contribute to things we use:
    61. 61. dCache (storage) – a DESY, FNAL, NDGF ++ collaboration
    62. 62. ARC (computing) – a Collaboration btw Nordic, Slovenian, Swiss insts.
    63. 63. SGAS (accounting) and Confusa (client-cert from IdPs)
    64. 64. BDII, WMS, SAM, AliEn, Panda – gLite/CERN tools
    65. 65. MonAmi, Nagios (Monitoring)</li></ul>30<br />
    66. 66. NDGF now and in the future<br /><ul><li>e-Infrastructure as a whole is important
    67. 67. Resources count Capacity and Capability Computing and different Network and Storage systems
    68. 68. The infrastructure must support different access methods (grid, ssh, application portals etc) – note that the average grid use of shared resources are only around 10-25%
    69. 69. Uniform User Mgmt, Id, Access, Accounting, Policy Enforcement and resource allocation and sharing
    70. 70. Independent of access method
    71. 71. For all users</li></li></ul><li>Enabling Grids for E-Science<br />Steven Newhouse<br />
    72. 72. Enabling Grids for E-Science<br />Project Status - Bob Jones - EGEE-III First Review 24-25 June 2009<br />33<br />New scientific community<br />Established user community<br />Networking activities<br />Middleware activities<br />Networking activities<br />Service activities<br />Duration: 2 years <br />Total Budget:<br />Staff ~47 M€<br />H/W ~50 M€ <br />EC Contribution: 32 M€<br />Total Effort:<br />9132 person months<br />~382 FTE <br />
    73. 73. Project Overview<br />17000 users<br />136000 LCPUs (cores)<br />25Pb disk<br />39Pb tape<br />12 million jobs/month<br />+45% in a year<br />268 sites<br />+5% in a year<br />48 countries<br />+10% in a year<br />162 VOs<br />+29% in a year<br />Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009<br />34<br />
    74. 74. Supporting Science<br />Archeology<br />Astronomy<br />Astrophysics<br />Civil Protection<br />Comp. Chemistry<br />Earth Sciences<br />Finance<br />Fusion<br />Geophysics<br />High Energy Physics<br />Life Sciences<br />Multimedia<br />Material Sciences<br />Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009<br />35<br />Resource Utilisation<br />End-user activity<br /><ul><li>13,000 end-users in 112 VOs
    75. 75. +44% users in a year
    76. 76. 23 core VOs
    77. 77. A core VO has >10% of usage within its science cluster</li></ul>Proportion of HEP usage ~77%<br />
    78. 78. Operations<br />Monitored 24x7 on a regional basis<br />Central help desk for all issues<br />Filtered to regional and specialist support unites<br />
    79. 79. gLite Middleware<br />Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009<br />37<br />User Interface<br />User Access<br />External Components<br />User Interface<br />EGEE Maintained Components<br />Information Services<br />General Services<br />Security<br />Services<br />Virtual Organisation Membership<br />Service<br />Workload<br />Management Service<br />Logging &<br />Book keeping<br />Service<br />Hydra<br />JSDL & BES<br />BDII<br />GLUE 2.0<br />X.509 Attributes<br />Proxy Server<br />AMGA<br />File Transfer<br />Service<br />LHC File<br />Catalogue<br />DMI<br />Storage Element<br />Compute Element<br />SCAS<br />CREAM<br />LCG-CE<br />JSDL & BES<br />Disk Pool Manager<br />SAML<br />SRM<br />Authz. Service<br />BLAH<br />MON<br />LCAS & LCMAPS<br />dCache<br />Worker Node<br />gLExec<br />Physical Resources<br />
    80. 80. TeraGrid<br />Daniel S. Katz<br />
    81. 81. TeraGrid Overview<br />TeraGrid is run by 11 resource providers (RPs) and integrated by Grid Infrastructure Group (GIG, at University of Chicago)<br />TeraGrid Forum (made of these 12 entities) decides policy by consensus (elected chair is John Towns, NCSA)<br />Funding is by separate awards from the National Science Foundation to the 12 groups<br />GIG sub-awards integration funding to the 11 RPs and some additional groups<br />Resources (distributed at the 11 RPs across the United States, connected by 10 Gbps paths)<br />14 HPC systems (1.6 PFlops, 310 TBytes memory)<br />1 HTC pool (105000 CPUs)<br />7 storage systems (3.0 PBytes on-line, 60 PBytes off-line)<br />2 viz systems (128 tightly integrated CPUs,14000 loosely coupled CPUs)<br />Special purpose systems (GPUs, FPGAs)<br />
    82. 82. Applications Community<br />(2008)<br />(2006)<br />Primarily HPC usage, but growing use of science gateways and workflows, lesser HTC usage<br />
    83. 83. Operations Infrastructure<br />Lots of services to keep this all together<br />Keep most things looking like one to the users, including:<br />Allocations, helpdesk, accounting, web site, portal, security, data movement, information services, resource catalog, science gateways, etc.<br />Working on, but don’t have in production yet:<br />Single global file system, identity management integrated with universities<br />Services supported by GIG, resources supported by RPs<br />
    84. 84. Middleware<br />Coordinated TeraGrid Software Stack is made of kits<br />All but one (Core Integration) are optional for RPs<br />Kits define a set of functionality and provide an implementation<br />Optional Kits: Data Movement, Remote Login Capability, Science Workflow Support, Parallel Application Capability, Remote Compute, Application Development and Runtime Support Capability, Metascheduling Capability, Data Movement Servers Capability, Data Management Capability, Data Visualization Support, Data Movement Clients Capability, Local Resource Provider HPC Software, Wide Area GPFS File Systems, Co-Scheduling Capability, Advance Reservation Capability, Wide Area Lustre File Systems, Science Gateway Kit<br />Current status:(kits only top, resources along left side, yellow means kit is installed, white meanskit is not installed)<br />Some kits are now being rolled-out (ScienceGateway) and will become more widelyused, some have limited functionality(Data Visualization Support) that onlymakes sense on some resources<br />
    85. 85. TeraGrid<br />TeraGrid considers itself the world’s largest open scientific computing infrastructure<br />Usage is free, allocations are peer-reviewed and available to all US researchers and their collaborators<br />TeraGrid is a platform on which others can build<br />Application developers<br />Science Gateways<br />TeraGrid is a research project<br />Learning how to do distributed, collaborative science on a continental-scale, federated infrastructure<br />Learning how to run multi-institution shared infrastructure<br />
    86. 86. Common Characteristics<br />Operating a production grid requires WORK<br />Monitoring, reporting, chasing, ...<br />No ‘off the shelf’ software solution<br />Plenty of components... But need verified assembly!<br />No central control<br />Distributed expertise leads to distributed teams<br />Resources are federated  Ownership lies elsewhere<br />No ownership by the Grid of hardware resources<br />All driven by delivering to user communities<br />
    87. 87. The Future in Europe: EGI<br />EGI: European Grid Initiative<br />Result of the EGI Design Study<br />2 year project to build community consensus<br />Move from project to sustainable funding<br />Leverage other sources of funding<br />Build on national grid initiatives (NGIs)<br />Provide the European ‘glue’ around independent NGIs<br />
    88. 88. The EGI Actors<br />46<br />Research Teams<br />Research Institutes<br />EGI<br />NGI2<br />NGI1<br /><br />NGIn<br />…<br />National Grid Initiatives (NGIs)<br />Resource Centres<br />
    89. 89. and NGI Tasks<br />47<br />NGI<br />NGI<br />NGI<br />NGI<br /><br /> tasks<br />NGI international tasks<br />NGI nationaltasks<br />
    90. 90. Differences between EGEE & EGI<br />48<br /><br />NGI Operations<br />Specialised<br />Support Centres<br />European Middleware<br />Initiative (EMI)<br />
    91. 91. Middleware<br />EGI will release UMD<br />Unified Middleware Distribution<br />Components needed to build a production grid<br />Initial main providers: <br />ARC, gLite & UNICORE<br />Expect to evolve components over time<br />Have defined interfaces to enable multiple providers<br />EMI project from ARC, gLite & UNICORE<br />Supports, maintains & harmonise software<br />Introduction & development of standards<br />
    92. 92. Current Status<br /><ul><li>8th July: EGI Council Meeting:
    93. 93. Confirmation of Interim Director
    94. 94. Establish Editorial team for the EC Proposals
    95. 95. 30thJuly: EC Call open
    96. 96. 1st October: Financial contributions to EGI Collaboration due
    97. 97. October/November: established
    98. 98. 24thNovember: EC Call closed
    99. 99. December 2009/January 2010: startup phase
    100. 100. Winter 2010: Negotiation phase for EGI projects
    101. 101. 1st May 2010: EGI projects launched</li></ul>Plans for Year II - Steven Newhouse - EGEE-III First Review 24-25 June 2009<br />50<br />
    102. 102. European Future?<br />Sustainability<br />E-Infrastructure is vital<br />Will underpin many research activities<br />Activity has to be driven by active stakeholders<br />