Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to ECP’s newest Focus Area, Hardware and Integration (HI)

579 views

Published on

The mission of ECP is to deliver a capable exascale computing ecosystem, which means one that is affordable, usable, and useful. Reaching that goal requires deploying and integrating software and hardware R&D products into the high-performance computing environments at US Department of Energy (DOE) facilities.

Partnership with DOE facilities is essential to ECP’s success, and the Hardware and Integration (HI) focus area provides the integrating element. Through strong collaboration with the facilities, HI assists the Application Development (AD) focus area efforts targeted at and supported on pre-exascale and exascale systems.

HI also helps in the creation of a production exascale software environment. So that AD and Software Technology focus area developers can create exascale production software efficiently, HI ensures they have access to the compute systems and productivity tools they need. Further, HI fosters the acceleration of critical early hardware technologies for DOE exascale systems.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Introduction to ECP’s newest Focus Area, Hardware and Integration (HI)

  1. 1. exascaleproject.org Introduction to ECP’s newest Focus Area, Hardware and Integration (HI) ECP Annual Meeting Terri Quinn, Director, Hardware and Integration Susan Coghlan, Deputy Director, Hardware and Integration Knoxville, Tennessee Feb 5-9, 2018
  2. 2. 2 LLNL IBM/NVidia P9/Volta Secure Relevant Pre-Exascale and Exascale Systems for ECP NERSC-9 Crossroads Frontier El Capitan Pre-Exascale Systems Exascale Systems Argonne IBM BG/Q Open Argonne Intel/Cray KNL Open ORNL Cray/NVidia K20 Open LBNL Cray/Intel Xeon/KNL Open LBNL TBD Open LANL/SNL TBD Secure Argonne Intel/Cray TBD Open ORNL TBD Open LLNL TBD Secure LANL/SNL Cray/Intel Xeon/KNL Secure 2013 2016 2018 2020 2021-2023 Summit Sierra ORNL IBM/NVidia P9/Volta Open LLNL IBM BG/Q Secure Sequoia CORI A21 Trinity Theta Mira Titan
  3. 3. 3 HI enables integration of ECP’s products into HPC environments at the Facilities to realize ECP’s primary objectives and the Facilities' objectives ECP products Applications Software Early HW R&D Facilities ASCR and NNSA HPC Facilities Facility resource utilization Training and productivity SW deployment at Facilities Application integration at Facilities HW evaluationPathForward US vendor system offerings Hardware and Integration HI is the result of a maturing of ECP’s thinking about the end game AD and ST are our customers Facilities are our customers
  4. 4. 4 Facility ECP Engagement Plans A strategic approach to: • Defining an engagement between the Facilities and their acquisition of pre- exascale and exascale systems and ECP • Purpose is to accelerate progress toward exascale • At the same time mitigating risks associated with such an ambitious goal • Achieving common objectives, eliminating duplication • Joint plans with commitments and activities • Rules of engagement, addressing a strategy for setting priorities, communication and conflict resolution Provide a high-level framework to enable actionable collaboration opportunities that are of mutual benefit to both the facilities and ECP Mission need Objectives
  5. 5. 5 HI is organized into six technical areas and in every area we are partnering with AD, ST, and the Facilities Critical early vendor HW R&D for multiple exascale-capable system designs PathForward HW evaluations and expertise to influence R&D for HPC systems and advise and inform ECP and Facility staff about the system characteristics HW evaluation Facility support for ECP application development efforts to port and optimize for exascale or pre-exascale systems Application integration Facility support for deploying ECP SW at the Facilities and integrating with each Facility’s exascale SW ecosystem Software deployment Management of and reporting on compute resources made available to ECP through the Facilities Facility resource utilization Specialized training, lessons learned, and best practices for AD and ST teams and in partnership with Facilities Training and productivity
  6. 6. 6 Bronis de Supinski, PathForward (2.4.1) 5 years as the CTO for a Facility (Livermore Computing, LLNL) Simon Hammond, HW Evaluation (2.4.2) 7 years in scalable computer architecture and app performance at a Facility (ACES, Sandia) Judy Hill, Application Integration at Facilities (2.4.3) 9 years as a computational scientist and currently managing a user program for ASCR Leadership Computing Facilities (ALCF and OLCF) Dave Montoya, Software Deployment at Facilities (2.4.4) 26 years exp. with SW development and deployment at a Facility (ACES, LANL) Julia White, Facility Resource Utilization (2.4.5) 7 years experience managing user programs at the ASCR Leadership Computing Facilities (ALCF and OLCF) Ashley Barker, Training and Productivity (2.4.6) 8 years as a group leader of user assistant and outreach at a Facility (OLCF) Meet the HI Leadership Team Susan Coghlan, Hardware and Integration Deputy Director Susan is a respected computer scientist and leader in high performance computing at ANL, with over 20 years as an extreme-scale supercomputer and systems integration expert. She is the Argonne Project Director for CORAL.
  7. 7. 7 Primary Objective • Accelerate critical early hardware R&D leading to 3–5 viable exascale system designs for DOE Facilities 2.4.1 PathForward Bronis de Supinski
  8. 8. 8 PathForward funds 6 US HPC companies to accelerate technologies to maximize the energy efficiency and overall performance of future supercomputers • 3-year program ending early in 2020 • Total value of the work is ~$430M, including the companies’ contributions • Examples of work funded include: a) innovative memory architectures b) higher-speed interconnects c) improved system reliability d) approaches for increasing computing power without prohibitive increases in energy demand • Advanced Micro Devices (AMD) • Cray Inc. (CRAY) • Hewlett Packard Enterprise (HPE) • International Business Machines (IBM) • Intel Corp. (Intel) • NVIDIA Corp. (NVIDIA)
  9. 9. 9 Primary Objective • Analyze application and associated SW performance on architectures to guide HW and SW design 2.4.2 Hardware Evaluation Simon Hammond Secondary Objectives • Be a conduit and outreach for hardware questions for broader ECP project • Support DOE facilities, procurements and operations • Ensure DOE gets best possible hardware options for ECP and beyond
  10. 10. 10 2.4.2 Hardware Evaluation – Project Components Hardware Evaluation covers five key areas (“working groups”) HE Node-Level Simulation Interconnect Modeling Memory Technologies Analytic Modeling Abstract Machines and High-Level System Models Processor, Pipelines, Threading, Caches, Coherency, Network-on-Chip, Network Interfaces Network Topologies, Congestion, Quality-of-Service, Silicon Photonics Memory Media, Parallelism, Controllers, Non-Volatile, Coherency, Scratchpads High-Level Architecture Balance, “Ops-to-Bytes”, Instruction Mix, Branching, Vectorization Outreach, Cross-ECP Communication, Non-NDA Models, High-Level Descriptions Outreach/cross-project responsibilities here
  11. 11. 11 2.4.2 Hardware Evaluation – Project Support Role Goal Ensure DOE/ECP and vendors get the best possible support and outcomes for PathForward and Exascale procurements Collection of AD Behavior Data (e.g. traces, models, etc) Prototyping / emulation opportunities for some ECP ST projects Cross-vendor assessment / comparison DOE-based generation of models to replicate vendor Research outcomes
  12. 12. 12 2.4.2 Hardware Evaluation - Outreach Improve Awareness of Hardware Systems • Abstract Machine Models (NDA Free Overview of Exascale Hardware Trends) • Hardware Evaluation teams at everyone of the E6 laboratories to support ECP • Specialists across DOE to call upon • Engagement with ST to build best practices for future hardware systems On ECP Confluence
  13. 13. 13 2.4.2 Hardware Evaluation – Facilities and Procurements Goals and Objectives Exascale procurements will probably be some of the most complicated the DOE Facilities have ever run. Ensure we have quantitative data to drive best possible outcomes for ECP Hardware Evaluation team will support DOE by providing performance analysis tools in-house Examples • “What is the impact on ECP application performance of buying a fat-tree network instead of a 7D torus?” • “Is it better to have 2x capacity or 2x performance for an Exascale memory system?”
  14. 14. 14 • Accelerate application readiness for exascale architectures through strong collaboration and partnership with DOE Facilities experts 2.4.3 Application Integration at Facilities Primary Objective Judy Hill HI engages Facility expertise to execute ECP scope Top priority for each application • Porting/optimizing help • HPC environment help AD, ST, Facilities Pre-exascale systems (intermediate stage for some apps) AD/ST/HI funded Facility staff working on porting and optimizing for their target system A21 Frontier El Capitan Form teams with AD and ST projects Application Development projects Software Technology projects AD and ST
  15. 15. 15 Management Team from the Facilities with strong application development experience Judy Hill, Frontier Application Integration (2.4.3.2) 9 years as a computational scientist at ORNL, a liaison in the OLCF’s Application Readiness effort for Summit, and currently managing a user program for ASCR Leadership Computing Facilities (ALCF and OLCF) Katherine Riley, A21 Application Integration (2.4.3.1) ALCF Director of Science with over 10 years of both facility and application readiness expertise for Intrepid, Mira, and A21 at the ALCF Deborah Bard, Pre-exascale System Application Integration (2.4.3.3) Big Data Architect with three years of experience at NERSC, strong application readiness efforts focused on NERSC’s Cori We are a resource to the ECP application and software teams for connections and touchpoints into the ASCR Computing Facilities.
  16. 16. 16 Application Integration at the Facilities Accelerating application readiness for the exascale architectures ECP Application Development Effort Augment AD efforts with additional facilities expertise. DOE Compute Facilities Provide AD efforts with access to (1) Facilities Vendor COEs and (2) production/development computing resources via ECP allocation program • Planned Approach: Leverage DOE Facilities Application Readiness expertise (from OLCF CAAR, ALCF ESP, NERSC NESAP programs) by providing: – Facilities computational scientists and performance engineering expertise to AD teams – Access to Facilities Vendor Centers of Excellences
  17. 17. 17 Initial Effort: Listening Tour to the ASCR Computing Facilities • Common threads from these discussions: – Best Practices from the Application Readiness Efforts include • Close interaction with architecture experts through dungeon sessions and other vendor engagements • Application readiness staff should be embedded in the Facilities Application Readiness efforts • Post-docs are integral to the success of these projects • Facility staff and post-docs are challenging to recruit; depend on the application readiness projects to “bring people forward”. – Facility observations include • Facilities have a lot of expertise developing FOMs that are measureable and defendable • Underlying software technologies are often in the critical path to application performance. Understanding the software stack on which an application depends (and the ST underlying milestones) is necessary.
  18. 18. 18 Goals and Objectives The overarching goal of this software integration effort is to work closely with the Facilities to establish the appropriate ECP software stack utilizing ECP ST, Site SW, and Vendor integration needs to meet the needs of their users targeting an exascale environment. Objectives: • Establish a well oiled integration process utilizing Continuous Integration • Pull together cross organization teams that understand the SW capabilities available through ECP, Facility Sites, and the needs to integrate with Vendor SW stacks Needed • Software characterization process to better understand the capabilities and issues across software sources • Close partnerships between ECP ST, Facilities, and Vendors with the common vision of a successful ECP environment. 2.4.4 Software Deployment at Facilities Dave Montoya End State: The integration of ECP software, vendor provided software, and facility based software environments to establish supported software stacks that meet application needs and provides for optimized facility operation.
  19. 19. 19 NNSA-ACES NNSA-LC NERSC OLCF ALCF Vendors – HW / SW ECP / Facility Software Integration Relationships Integration of ECP ST products to the sites: • Application dependencies • Continuous Integration - architectures, facilities • Vendor Integration • Testing in production env • Needs from sites and ST feedback Integration of Site / ECP ST products with vendor SW stack: • Operational SW stack • Support structure • Key Vendor integration role • Need solid understanding on ECP SW • Feedback to ECP ST • Busy – HPC Centers to build, users to support
  20. 20. 20 Continuous Integration project • Work with ECP-ST/AD and Facilities to select vendor integration solution • Target Use Cases – Test beds for architectural integration (based at sites) – Site integration testing targeting production environments • Facility based team to integrate system and process. - possibly Vendors. Challenges to work out: • Security integration • Scheduling process • Target test resources • Intrusive testing • Broad SW ecosystem needs • Integration into site production • Local CI integration across projects • Site CI support
  21. 21. 21 Software Characterization Site Stack Characterization Communication Feedback Process • ECP-HI/Facility team to work with ECP-ST to identify and integrate SW stacks • Characterization of Site SW stack / vendor stack concerns • Map ECP Applications -> ECP ST products -> Site ecosystem • SC – ECP App Integration | NNSA – COE’s, other • Deployment: Integrate software products between ECP, Facilities, and possibly Vendors • Coalition across sites to put this together • Unique to sites / common across sites • SW stack makeup • Interface points and process • Vendor Integration • ECP ST Capability • Containerization • Integration into site production • Ecosystem health • Feedback and requests Integration and Deployment of SW project challenges to work out:
  22. 22. 22 Next Steps Objectives: • Establish a well oiled integration process utilizing Continuous Integration • Pull together cross organization teams that understand the SW capabilities available through ECP, Facility Sites, and the needs to integrate with Vendor SW stacks Needed: • Software characterization process to better understand the capabilities and issues across software sources – Facility Resourced Team • Cross-org team to tackle Continuous Integration deployment • Close partnerships between ECP ST, Facilities, and Vendors with the common vision of a successful ECP environment. Challenges and Opportunities: • Continuous Integration comes with feature development contracts, site challenges that include security and resource availability, and multiple existing deployments • Facility Sites are resource constrained, busy building HPC centers. Vendor integration of SW environment requires time • Effort and communication needed to understand regarding the ECP ST products • Vendors willing to work to ease software stack integration – could this mean Vendors could work together?
  23. 23. 23 Primary Objective • Manage the allocation of awards of computer time to ECP subprojects and report accomplishments and provide feedback to ECP leadership and Facilities • ECP Resource Allocation Council • Comprised of representatives from the ECP L2’s and DOE HPC Facilities, the Council meets once a month to discuss requests and track usage (e.g., Titan, Mira, Theta, Cori) • If you have a need for compute time or resources, talk with your L3 or L2 lead, or email: ECPresourceallocation@exascaleproject.org 2.4.5 Facility Resource Utilization Julia White
  24. 24. 24 Primary Objective Objective: Provide training on key ECP technologies and accelerate the software development cycle and optimize the productivity of application and software developers Approach: • Collaborate with the DOE computing facilities and the ECP project teams to determine the most impactful training and productivity activities • Develop and deliver training through a variety of activities such as seminars, webinars, deep dive workshops and lectures, hackathons, and tutorials • Disseminate and transfer knowledge, lessons learned, and best practices across ECP project teams and to the broader HPC community • Develop, promote, and apply productivity tools and methodologies that aid developers • Use performance metrics and a benchmark strategy to improve overall project productivity The T&P project is split into two efforts: • ECP Training (WBS 2.4.6.01) • ECP Project Developer Productivity (WBS 2.4.6.02) (aka IDEAS-ECP) • Led by Mike Heroux and Lois Curfman McInnes 2.4.6 Developer Training and Productivity Ashley Barker 1. Where you can find the training announcements and/or archived material from past events? 2. How can you communicate your training needs? 3. How to offer training on your content?
  25. 25. 25 Communicating ECP T&P Events • ECP training events are communicated through a variety of ways: – Advertised on the external ECP website – ECP newsletter – Communicated via monthly email to members of the ECP project – Added to ECP calendar – Shared through Confluence blog posts – Shared with facilities and other training contacts – IDEAS-ECP events are also emailed to the IDEAS-ECP mailing list • We upload presentations and videos to the ECP website when it makes sense. We also upload videos to the ECP YouTube channel. • In addition, IDEAS-ECP events are also archived on https://ideas-productivity.org. • The training to-date has been a mix of tutorials, webinars, workshops, and hands-on deep dives. ECP Public Website: https://www.exascaleproject.org/
  26. 26. 26 Selected ECP T&P Activities To-Date Date Event # Registered # Attended FY17 Q2 Introduction to ECP Project Mgmt Tools (Confluence & JIRA) N/A N/A FY17 Q3 Python in HPC 210 132 OpenMP Tutorial 113 60 FY17 Q4 Argonne Training Program on Extreme-Scale Computing 2017 168 (applied) 71 (selected) Intermediate Git 149 78 Using the Roofline Model and Intel Advisor 113 71 Barely Sufficient Project Management 190 122 Scalable Node Programming with OpenACC 75 46 FY18 Q1 HipChat Webinar N/A N/A Managing Defects in HPC Software Development 135 63 TAU Performance System Webinar 53 31 Better Scientific Software (BSSw) Webinar 206 108 • Over 780 people have participated in ECP Training activities to-date • ECP team members have led additional tutorials and workshops at events such as SC, ISC, and PASC. • Have created documentation and best practices covering topics like ECP project management tools Public training materials are archived on the ECP External Website: http://exascaleproject.org Internal training materials are found on the ECP Confluence Knowledgebase site: https://confluence.exascaleproject.org/display/KB
  27. 27. 27 2018 Activities Planned To-Date • Performance Portability with Kokkos Bootcamp, January 16-18, 2018 – https://www.exascaleproject.org/event/kokkosbc/ • IDEAS-ECP Webinar: Bringing Best Practices to a Long-Lived Production Code, January 17, 2018 – https://www.exascaleproject.org/event/bp4long- lived_codes/ • 30 unique tutorials are taking place this week at the ECP All-Hands Meeting – Feedback is highly desired • Next IDEAS-ECP webinar is Feb 28 on the topic of Jupyter Notebooks and HPC. – https://www.exascaleproject.org/event/jupyter/ – March Webinar: Eclipse – April Webinar: Software Citation Webinar Materials are Available on the ECP Website at https://www.exascaleproject.org
  28. 28. 28 Argonne Training Program on Extreme-Scale Computing ATPESC 2018 extremecomputingtraining.anl.gov Call closes Feb 28, 2018! July 29 - August 10, 2018 Q Center, St Charles, IL (USA) Seeking all doctoral students, postdocs, and scientists interested in conducting CS&E research on large-scale computers. Who? What? When? Where? An intensive two-week program on HPC methodologies applicable to both current and future supercomputers. How to apply? 60-70 100 h Lectures & Hands-on no fees to participate Domestic airfare, meals and lodging provided $1.25M 2016-2018 $0 no cost to attendVisit ATPESC website - statement of purpose, CV & letter of recommendation needed participants
  29. 29. 29 FY18 - FY19 Plans • FY18, FY19 – Work with AD, ST, HI, and facilities to determine the highest priority training needs for the upcoming FY – Using the information above, develop training plan for the next FY – Work with ECP teams, facilities, laboratories, vendors, and the broader HPC community to offer at minimum two training events per quarter in each FY – Make training materials available online whenever possible – Study your feedback from this meeting to improve tutorials for next year
  30. 30. 30 Software Challenges: Exploit massive on-node concurrency and handle disruptive architectural changes while working toward predictive simulations that couple physics, scales, analytics, and more. Goal: Improve ECP developer productivity and software sustainability while ensuring continued scientific success. Strategy: In collaboration with ECP community: • Customize and curate methodologies for application productivity and sustainability • Create an ECP Application Development Kit of customizable resources for improving scientific sw devpt • Partner with ECP teams on software improvements • Training and outreach in partnership with DOE facilities • Interview, analyze, prototype, test, revise, deploy. Repeat. • Realistic: There is a cost. – Startup: Overhead - Payoff: Best if soon, clear • Working model: – Help you be more productive via consulting, teaming. You own it. – Develop content for broad use, e.g., issue management strategies Cost ProgressStart Finish Old Process New Process www.ideas-productivity.org IDEAS-ECP: Advancing Software Productivity for Exascale Applications
  31. 31. 31 Productivity and Sustainability Improvement Plans (PSIPs): Framework for improving scientific software quality The IDEAS Productivity and Sustainability Improvement Planning (PSIP) process is an iterative, incremental, repeatable process for enabling ECP app teams to improve software quality and achieve science goals. Next steps: • Disciplined application of PSIP • Grow body of practice improvement resources Focus: Needs of aggregate teams: composed of multiple successful previously existing teams, where software is a primary means of collaboration. See PSIP-Tools repo: https://betterscientificsoftware.github.io/PSIP-Tools Progress Tracking Card: Test Coverage ECP app aggregate team Team F Team B Team C Team A Team D Team E Astro, CANDLE, EXAALT, NWChemEx, MARBL, QMCPACK, SPARC, MPICH, UnifyCR, VeloC
  32. 32. 32 WhatIs and HowTo docs Motivation: CSE teams have wide range of maturity in software engineering practices. Resources • What Is docs: 2-page characterizations of important topics for CSE software projects • How To docs: brief sketch of best practices – Emphasis on ``bite-sized'' topics enables software teams to consider improvements at a small but impactful scale • Current topics: (more under development) – What Is CSE Software Productivity? – What Is Software Configuration? – How to Configure Software – What Is Performance Portability? – How to Enable Performance Portability – What Is CSE Software Testing? – What Are Software Testing Practices? Impact: Baseline nomenclature and foundation for next steps in software quality for CSE teams. https://ideas-productivity.org/resources/howtos – How to Add and Improve Testing in a CSE Software Project – What Is Good Documentation? – How to Write Good Documentation? – What Are Interoperabile Software Libraries? – What is Version Control? – How to do Version Control with Git
  33. 33. 33 Resources for software productivity & sustainability: key element of overall scientific productivity Tutorials • Enhancing Software Productivity with a Team of Teams Approach (Feb 8, 3:30 pm) Prior Sessions: • On-demand Learning for Better Scientific Software: How to Use Resources & Technology to Optimize your Productivity • What All Codes Should Do: Overview of Best Practices in HPC Software Development • Introduction to Git • Better (Small) Scientific Software Teams • Improving Reproducibility through Better Software Practices • Testing and Verification • Code Coverage and Continuous Integration • An Introduction to Software Licensing Webinar Series: Best Practices for HPC Software Developers • Jupyter Notebooks (Feb 2018) • Eclipse for HPC (March 2018) Prior Sessions: • Bringing Best Practices to a Long Lived Production Code • Introducing the Better Scientific Software Site • Managing Defects in HPC Software • Barely Sufficient Project Management: A few techniques for improving your scientific software efforts • Using the Roofline Model and Intel Advisor • Intermediate Git • Python in HPC And more resources … Coming soon to https://bssw.io … new web-based hub for scientific software improvement exchange Impact: Helping science teams to achieve: Better: Science, portability, robustness, composability Faster: Execution, development, dissemination Cheaper: Fewer staff hours and lines of code Slides and videos available via https://ideas-productivity.org/events And more …. Suggestions welcome!
  34. 34. 34 User Stories: Flexible means of conveying requirements garnered from interviews, PSIPs, and informal interactions. Action Tasks for User Story: ● Create a document that details "design patterns for Git workflows" ● Create Transmedia Learning Framework (TLF) template ● Develop curated content on BSSw ● Provide webinar: “On-demand Learning for Better Scientific Software: How to Use Resources & Technology to Optimize your Productivity" ● Provide assistance to ECP community with developing and using TLFs Training & Documentation As a casual user of GitHub, I want more GitHub tutorials and tips so that it becomes easier for me to recall the functionality. Operational As a participant in the CSE software engineering community, I want a documented process for contributing to the bssw.io website so that I can add my knowledge to the site in an efficient way. Software Integration & Testing As an application architect, I want to better understand version control capabilities that allow integration of independently developed components so that we can distribute a coherent software stack. As an ECP developer, I want documentation and training in setting up automated testing for my package as well as using my package testing within the ECP CI system so that I can reliably determine and regularly track that various of my package branches compile and minimal tests pass in all configurations and machines relevant to other ECP users Software Quality As a person responsible for software quality and correctness for my project, I want guidance on selecting and implementing coding standards so that we can make our code easy for everyone to read and understand. Practices & Standards Software Req. & Dev. As a software engineer in HPC, I want to connect test development to software design so that intertwined dependencies that get in the way of building stand-alone tests can be minimized.Interviews and PSIPs User Stories Develop material Outreach and dissemination Feedback and refinement
  35. 35. exascaleproject.org Accelerating the development of a capable exascale computing ecosystem

×