Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Toward a National Research Platform

35 views

Published on

Invited Presentation
Open Science Grid All Hands Meeting
Salt Lake City, UT
March 20, 2018

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Toward a National Research Platform

  1. 1. “Toward a National Research Platform” Invited Presentation Open Science Grid All Hands Meeting Salt Lake City, UT March 20, 2018 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net 1
  2. 2. 30 Years Ago NSF Brought to University Researchers a DOE HPC Center Model NCSA Was Modeled on LLNL SDSC Was Modeled on MFEnet 1985/6
  3. 3. • The First National Telecom-Interconnected 155 Mbps Research Network – 65 Science Projects – Into the San Diego Convention Center • I-Way Featured: – Networked Visualization Applications – Large-Scale Immersive Displays – I-Soft Programming Environment – Led to the Globus Project I-WAY: Information Wide Area Year Supercomputing ‘95 UIC http://archive.ncsa.uiuc.edu/General/Training/SC95/GII.HPCC.html See talk by: Brian Bockelman
  4. 4. NSF’s PACI Program was Built on the vBNS to Prototype America’s 21st Century Information Infrastructure The PACI Grid Testbed National Computational Science 1997 vBNS led to Key Role of Miron Livny & Condor
  5. 5. UCSD Has Been Working Toward PRP for Over 15 Years: NSF OptIPuter, Quartzite, Prism Awards PI Papadopoulos, 2013-2015 PI Smarr, 2002-2009 PI Papadopoulos, 2004-2007 Precursors to DOE Defining DMZ in 2010
  6. 6. Based on Community Input and on ESnet’s Science DMZ Concept, NSF Has Funded Over 100 Campuses to Build DMZs Red 2012 CC-NIE Awardees Yellow 2013 CC-NIE Awardees Green 2014 CC*IIE Awardees Blue 2015 CC*DNI Awardees Purple Multiple Time Awardees Source: NSF NSF Program Officer: Kevin Thompson
  7. 7. (GDC) Logical Next Step: The Pacific Research Platform Networks Campus DMZs to Create a Regional End-to-End Science-Driven “Big Data Superhighway” System NSF CC*DNI Grant $5M 10/2015-10/2020 PI: Larry Smarr, UC San Diego Calit2 Co-PIs: • Camille Crittenden, UC Berkeley CITRIS, • Tom DeFanti, UC San Diego Calit2/QI, • Philip Papadopoulos, UCSD SDSC, • Frank Wuerthwein, UCSD Physics and SDSC Letters of Commitment from: • 50 Researchers from 15 Campuses • 32 IT/Network Organization Leaders NSF Program Officer: Amy Walton Source: John Hess, CENIC
  8. 8. Note That the OSG Cluster Map Has Major Overlap with the NSF-Funded DMZ Map Source: Frank Würthwein, OSG, UCSD/SDSC, PRP NSF CC* Grants
  9. 9. Bringing OSG Software and Services to a Regional-Scale DMZ Source: Frank Würthwein, OSG, UCSD/SDSC, PRP
  10. 10. • FIONAs PCs [a.k.a ESnet DTNs]: – ~$8,000 Big Data PC with: – 1 CPUs – 10/40 Gbps Network Interface Cards – 3 TB SSDs or 100+ TB Disk Drive – Extensible for Higher Performance to: – +NVMe SSDs for 100Gbps Disk-to-Disk – +Up to 8 GPUs [4M GPU Core Hours/Week] – +Up to 160 TB Disks for Data Posting – +Up to 38 Intel CPUs – $700 10Gpbs FIONAs Being Tested • FIONettes are $270 FIONAs – 1Gbps NIC With USB-3 for Flash Storage or SSD Big Data Science Data Transfer Nodes (DTNs)- Flash I/O Network Appliances (FIONAs) FIONette—1G, $250 Phil Papadopoulos, SDSC & Tom DeFanti, Joe Keefe & John Graham, Calit2 Key PRP Innovation: UCSD Designed FIONAs To Solve the Disk-to-Disk Data Transfer Problem at Full Speed on 10/40/100G Networks FIONAS—10/40G, $8,000
  11. 11. We Measure Disk-to-Disk Throughput with 10GB File Transfer Using Globus GridFTP 4 Times Per Day in Both Directions for All PRP Sites January 29, 2016 From Start of Monitoring 12 DTNs to 24 DTNs Connected at 10-40G in 1 ½ Years July 21, 2017 Source: John Graham, Calit2/QI
  12. 12. PRP’s First 2 Years: Connecting Multi-Campus Application Teams and Devices Earth Sciences
  13. 13. PRP Over CENIC Couples UC Santa Cruz Astrophysics Cluster to LBNL NERSC Supercomputer CENIC 2018 Innovations in Networking Award for Research Applications
  14. 14. 100 Gbps FIONA at UCSC Allows for Downloads to the UCSC Hyades Cluster from the LBNL NERSC Supercomputer for DESI Science Analysis 300 images per night. 100MB per raw image 120GB per night 250 images per night. 530MB per raw image 800GB per night Source: Peter Nugent, LBNL Professor of Astronomy, UC Berkeley Precursors to LSST and NCSA NSF-Funded Cyberengineer Shaw Dong @UCSC Receiving FIONA Feb 7, 2017
  15. 15. Jupyter Has Become the Digital Fabric for Data Sciences PRP Creates UC-JupyterHub Backbone Source: John Graham, Calit2 Goal: Jupyter Everywhere
  16. 16. LHCOne Traffic Growth is Large Now But Will Explode in 2026 31 Petabytes in January 2018 +38% Change Within Last Year LHC Accounts for 47% of Total ESNet traffic Today Dramatic Data Volume Growth Expected for HL-LHC in 2026 Source: Frank Würthwein, OSG, UCSD/SDSC, PRP
  17. 17. Data Transfer Rates From 40 Gbps DTN in UCSD Physics Building, Across Campus on PRISM DMZ, Then to Chicago’s Fermilab Over CENIC/ESnet Based on This Success, Würthwein Will Upgrade 40G DTN to 100G For Bandwidth Tests & Kubernetes Integration With OSG, Caltech, and UCSC Source: Frank Würthwein, OSG, UCSD/SDSC, PRP
  18. 18. LHC Data Analysis Running on PRP Source: Frank Würthwein, OSG, UCSD/SDSC, PRP Two Projects: • OSG Cluster-in-a-Box for “T3” • Distributed Xrootd Cache for “T2”
  19. 19. First Steps Toward Integrating OSG and PRP – Tier 3 “Cluster-in-a Box” Source: Frank Würthwein, OSG, UCSD/SDSC, PRP
  20. 20. PRP Distributed Tier-2 Cache Across Caltech & UCSD Cache Server Cache Server… Redirect or Cache Server Cache Server… Redirect or UCSD Caltech Redirector Top Level Cache Global Data Federation of CMS Applications Can Connect at Local or Top Level Cache Redirector  Test the System as Individual or Joint Cache Provisioned pilot systems: PRP UCSD: 9 x 12 SATA Disk of 2TB @ 10Gbps for Each System PRP Caltech: 2 x 30 SATA Disk of 6TB @ 40Gbps for Each System Production Use (UCSD only) I/O in Production Limited by # of Apps Hitting the Cache, and Their I/O Patterns Source: Frank Würthwein, OSG, UCSD/SDSC, PRP
  21. 21. Game Changer: Using Kubernetes to Manage Containers Across the PRP “Kubernetes is a way of stitching together a collection of machines into, basically, a big computer,” --Craig Mcluckie, Google and now CEO and Founder of Heptio "Everything at Google runs in a container." --Joe Beda,Google “Kubernetes has emerged as the container orchestration engine of choice for many cloud providers including Google, AWS, Rackspace, and Microsoft, and is now being used in HPC and Science DMZs. --John Graham, Calit2/QI UC San Diego See talk by: Rob Gardner
  22. 22. Distributed Computation on PRP Nautilus HyperCluster Coupling SDSU Cluster and SDSC Comet Using Kubernetes Containers 25 years Developed and executed MPI-based PRP Kubernetes Cluster execution [CO2,aq] 100 Year Simulation 4 days 75 years 100 years • 0.5 km x 0.5 km x 17.5 m • Three sandstone layers separated by two shale layers Simulating the Injection of CO2 in Brine-Saturated Reservoirs: Poroelastic & Pressure-Velocity Fields Solved In Parallel With MPI Using Domain Decomposition Across Containers Source: Chris Paolini and Jose Castillo, SDSU
  23. 23. Rook is Ceph Cloud-Native Object Storage ‘Inside’ Kubernetes https://rook.io/ Source: John Graham, Calit2/QI See talk by: Shawn McKee
  24. 24. FIONA8: Adding GPUs to FIONAs Supports Data Science Machine Learning Multi-Tenant Containerized GPU JupyterHub Running Kubernetes / CoreOS Eight Nvidia GTX-1080 Ti GPUs ~$13K 32GB RAM, 3TB SSD, 40G & Dual 10G ports Source: John Graham, Calit2
  25. 25. FIONA8 FIONA8 100G Epyc NVMe Nautilus - A Multi-Tenant Containerized PRP HyperCluster for Big Data Applications Running Kubernetes with Rook/Ceph Cloud Native Storage and GPUs for Machine Learning 40G SSD 3T 100G NVMe 6.4T SDSU 100G Gold NVMe March 2018 John Graham, Calit2/QI 100G NVMe 6.4T Caltech 40G SSD UCAR FIONA8 UCI FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 sdx-controller controller-0 Calit2 100G Gold FIONA8 SDSC 40G SSD UCR 40G SSD USC 40G SSD UCLA 40G SSD Stanford 40G SSD UCSB 100G NVMe 6.4T 40G SSD UCSC 40G SSD Hawaii Rook/Ceph - Block/Object/FS Swift API compatible with SDSC, AWS, and Rackspace Kubernetes Centos7
  26. 26. FIONA8 FIONA8 100G Epyc NVMe 40G 160TB 100G NVMe 6.4T SDSU 100G Gold NVMe March 2018 John Graham, UCSD 100G NVMe 6.4T Caltech 40G 160TB UCAR FIONA8 UCI FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 sdx-controller controller-0 Calit2 100G Gold FIONA8 SDSC 40G 160TB UCR 40G 160TB USC 40G 160TB UCLA 40G 160TB Stanford 40G 160TB UCSB 100G NVMe 6.4T 40G 160TB UCSC 40G 160TB Hawaii Running Kubernetes/Rook/Ceph On PRP Allows Us to Deploy a Distributed PB+ of Storage for Posting Science Data Rook/Ceph - Block/Object/FS Swift API compatible with SDSC, AWS, and Rackspace Kubernetes Centos7
  27. 27. Collaboration Opportunity with OSG & PRP on Distributed Storage 1.8PB1.2PB1.6PB 210TB Total data volume pulled last year is dominated by 4 caches. OSG Is Operating a Distributed Caching CI. At Present, 4 Caches Provide Significant Use PRP Kubernetes Infrastructure Could Either Grow Existing Caches by Adding Servers, or by Adding Additional Locations See talks by: Alex Feltus Derek Weitzel StashCache Users include: See talk by Marcelle Soares-Santos LIGO DES Source: Frank Würthwein, OSG, UCSD/SDSC, PRP
  28. 28. New NSF CHASE-CI Grant Creates a Community Cyberinfrastructure: Adding a Machine Learning Layer Built on Top of the Pacific Research Platform Caltech UCB UCI UCR UCSD UCSC Stanford MSU UCM SDSU NSF Grant for High Speed “Cloud” of 256 GPUs For 30 ML Faculty & Their Students at 10 Campuses for Training AI Algorithms on Big Data NSF Program Officer: Mimi McClure
  29. 29. 48 GPUs for OSG Applications UCSD Adding >350 Game GPUs to Data Sciences Cyberinfrastructure - Devoted to Data Analytics and Machine Learning SunCAVE 70 GPUs WAVE + Vroom 48 GPUs FIONA with 8-Game GPUs 88 GPUs for Students CHASE-CI Grant Provides 96 GPUs at UCSD for Training AI Algorithms on Big Data
  30. 30. Next Step: Surrounding the PRP Machine Learning Platform With Clouds of GPUs and Non-Von Neumann Processors Microsoft Installs Altera FPGAs into Bing Servers & 384 into TACC for Academic Access CHASE-CI 64-TrueNorth Cluster 64-bit GPUs 4352x NVIDIA Tesla V100 GPUs See talk by: Hurtado Anampa
  31. 31. PRP is Partnering with NSF Grants Supporting Advanced Cyberinfrastructure Facilitators to Explore PRP Extension Toward NRP PRP Connected  ACI-REF has also spawned the 35-member Campus Research Computing Consortium (CaRCC), Funded by the NSF as a Research Coordination Network (RCN)  CaRCC is Dedicated to Sharing Best Practices, Expertise, and Resources, Enabling the Advancement of Campus-Based Research Computing Activities Across the Nation Jim Bottum, Principal Investigator Tom Cheatham, ACI REF Chair of Campus PIs ACI-REF CaRCC See talk by Tom Cheatham
  32. 32. Expanding to the Global Research Platform Via CENIC/Pacific Wave, Internet2, and International Links PRP PRP’s Current International Partners Korea Shows Distance is Not the Barrier to Above 5Gb/s Disk-to-Disk Performance Netherlands Guam Australia Korea Japan Singapore
  33. 33. The Second National Research Platform Workshop Bozeman, MT August 6-7, 2018 A follow-up FIONA workshop will be held as a lead into the 2nd NRP workshop in Bozeman, starting August 2nd. While the workshop will be open to the community, there is a specific focus on EPSCoR-affiliated and minority serving institutions. Co-Chairs: Larry Smarr, Calit2 Inder Monga, ESnet Ana Hunsinger, Internet2 Local Host: Jerry Sheehan, MSU
  34. 34. Our Support: • US National Science Foundation (NSF) awards  CNS 0821155, CNS-1338192, CNS-1456638, CNS-1730158, ACI-1540112, & ACI-1541349 • University of California Office of the President CIO • UCSD Chancellor’s Integrated Digital Infrastructure Program • UCSD Next Generation Networking initiative • Calit2 and Calit2 Qualcomm Institute • CENIC, PacificWave and StarLight • DOE ESnet

×