“A Campus-Scale High Performance
       Cyberinfrastructure is Required
         for Data-Intensive Research”
                      Keynote Presentation
                          CENIC 2013
                      Held at Calit2@UCSD
                         March 11, 2013



                             Dr. Larry Smarr
Director, California Institute for Telecommunications and Information
                                Technology
                       Harry E. Gruber Professor,
             Dept. of Computer Science and Engineering
                                                                        1
                 Jacobs School of Engineering, UCSD
“Blueprint for the Digital University”--Report of the UCSD
      Research Cyberinfrastructure Design Team
• A Five Year Process Begins Pilot Deployment This Year

                                                                    April 2009

  No Data
Bottlenecks
--Design for
  Gigabit/s
 Data Flows



 See talk on RCI by Richard Moore
           Today at 4pm


             research.ucsd.edu/documents/rcidt/RCIDTReportFinal2009.pdf
Calit2 Sunlight OptIPuter Exchange
 Connects 60 Campus Sites Each Dedicated at 10Gbps




 Maxine
 Brown,
EVL, UIC
OptIPuter
 Project
Manager
Rapid Evolution of 10GbE Port Prices
   Makes Campus-Scale 10Gbps CI Affordable

               • Port Pricing is Falling
$80K/port
               • Density is Rising – Dramatically
Chiaro         • Cost of 10GbE Approaching Cluster HPC
(60 Max)
               Interconnects

                 $ 5K
                 Force 10
                 (40 max)


                                         $ 500
                                         Arista
                                         48 ports



2005            2007              2009              2010     2011
2013

                                    $ 400 (48 ports – today); 576 ports (2013)


            Source: Philip Papadopoulos, SDSC/Calit2
Arista Enables SDSC’s Massively Parallel
 10G Switched Data Analysis Resource


         12
Partnering Opportunities with NSF:
         SDSC’s Gordon-Dedicated Dec. 5, 2011
• Data-Intensive Supercomputer Based on
  SSD Flash Memory and Virtual Shared Memory SW
  – Emphasizes MEM and IOPS over FLOPS
  – Supernode has Virtual Shared Memory:
     – 2 TB RAM Aggregate
     – 8 TB SSD Aggregate
  – Total Machine = 32 Supernodes
     – 4 PB Disk Parallel File System >100 GB/s I/O
• System Designed to Accelerate Access
  to Massive Datasets being Generated in
  Many Fields of Science, Engineering, Medicine,
  and Social Science

                Source: Mike Norman, Allan Snavely SDSC
Gordon Bests Previous
Mega I/O per Second by 25x
Creating a “Big Data Freeway” System
Connecting Instruments, Computers, & Storage
                                       Phil Papadopoulos, PI
                                       Larry Smarr co-PI




                                            PRISM
                                            @UCSD

                                           Start Date
                                             1/1/13




                                        See talk on
                                          PRISM
                                         by Phil P.
                                       Tomorrow at
                                           9am
Many Disciplines Beginning to Need
           Dedicated High Bandwidth on Campus
           How to Terminate a CENIC 100G Campus Connection


• Remote Analysis of Large Data Sets
   – Particle Physics
• Connection to Remote Campus Compute & Storage Clusters
   – Ocean Observatory
   – Microscopy and Next Gen Sequencers
• Providing Remote Access to Campus Data Repositories
   – Protein Data Bank and Mass Spectrometry
• Enabling Remote Collaborations
   – National and International
PRISM@UCSD Enables
Remote Analysis of Large Data Sets
CERN’s CMS Detector is
One of the World’s Most Complex Scientific Instrument




        See talk on LHC 100G Networks by Azher Mughal, Caltech
                            Today at 10am
CERN’s CMS Experiment
Generates Massive Amounts of Data
UCSD is a Tier-2 LHC Data Center




  Source: Frank Wuerthwein, Physics UCSD
Flow Out of CERN for CMS Detector
        Peaks at 32 Gbps!




                                            14
   Source: Frank Wuerthwein, Physics UCSD
CMS Flow Into Fermi Lab
    Peaks at 10Gbps




                                         15
Source: Frank Wuerthwein, Physics UCSD
CMS Flow into UCSD Physics
    Peaks at 2.4 Gbps




                                         16
Source: Frank Wuerthwein, Physics UCSD
The Open Science Grid
     A Consortium of Universities and National Labs




Open for all of science, including
biology, chemistry, computer science,
engineering, mathematics, medicine, and physics

                   Source: Frank Wuerthwein, Physics UCSD
Planning for climate change in California
  substantial shifts on top of already high climate variability




                                        Dan Cayan
                         USGS Water Resources Discipline
                               Scripps Institution of Oceanography, UC San Diego

                             much support from Mary Tyree, Mike Dettinger, Guido Franco and
                                  other colleagues




                                          Sponsors:
                                            California Energy Commission
                                            NOAA RISA program
                                            California DWR, DOE, NSF
UCSD Campus Climate Researchers Need to Download
  Results from Remote Supercomputer Simulations



                                              Greenhouse
                                                 Gas
                                               Emissions
                                                  and
                                             Concentration
                                             CMIP3 GCM’s




               Source: Dan Cayan, SIO UCSD
GCMs ~150km
Global to Regional Downscaling                downscaled to
                                            Regional models ~ 12km

                                               Many simulations
                                            IPCC AR4 and IPCC AR5
                                             have been downscaled
                                            using statistical methods


INCREASING VOLUME
  OF CLIMATE SIMULATIONS


 in comparison to 4th IPCC (CMIP3) GCMs :

Latest Generation CMIP5 Models Provide:
   More Simulations
   Higher Spatial Resolution
   More Developed Process Representation
   Daily Output is More Available




     Source: Dan Cayan, SIO UCSD
average summer
average summer
    afternoon temperature
    afternoon temperature




                   GFDL A2 1km downscaled to 1km         21
                Hugo Hidalgo Tapash Das Mike Dettinger
HOW MUCH CALIFORNIA SNOW LOSS ?
 Initial projections indicate substantial reduction
              in snow water for Sierra Nevada+




                                                   declining Apr 1 SWE:
                                        2050 median SWE ~ 2/3 historical median
                                        2100 median SWE ~ 1/3 historical median
PRISM@UCSD Enables
Connection to Remote Campus Compute & Storage Clusters
The OOI CI is Built on Dedicated 10GE
and Serves Researchers, Education, and Public




         Source: Matthew Arrott, John Orcutt OOI CI
Reused Undersea Optical Cables
Form a Part of the Ocean Observatories




        Source: John Delaney UWash OOI
OOI CI is Built on Dedicated Optical Networks and
          Federal Agency & Commercial Clouds
  Source: John Orcutt,
Matthew Arrott, SIO/Calit2
OOI CI Team at Scripps Institution of Oceanography
 Needs Connection to Its Server Complex in Calit2
Ultra High Resolution Microscopy Images
Created at the National Center for Microscopy Imaging
Microscopes Are Big Data Generators –
Driving Software & Cyberinfrastructure Development
               Zeiss Merlin 3View w/ 32k x 32k Scanning and
               Automated Mosaicing:

                  Current= 1-2 TB/week  soon 12 TB/week


               JEOL-4000EX w/ 8k x 8k CD, Automated Mosaicing,
               and Serial Tomography:


                  Current= 1 TB/week

               FEI Titan w/ 4k x 4k STEM, EELS, 4k x 3.5k DDD, 4k
               x4k CCD, Automated Mosaicing, and Multi-tilt
               Tomography:

                  Current= 1 TB/week
            200-500 TB/year Raw  >2 PB/year Aggregate
             Source: Mark Ellisman, School of Medicine, UCSD
NIH National Center for Microscopy & Imaging Research
     Integrated Infrastructure of Shared Resources



                   Shared Infrastructure




 Scientific                                                   Local SOM
Instruments                                                  Infrastructure




                                  End User
                                 Workstations
               Source: Steve Peltier, Mark Ellisman, NCMIR
Agile System that Spans Resource Classes
SDSC Gordon Supercomputer Analysis
of LS Gut Microbiome Displayed on Calit2 VROOM
                      See Live Demo
                  on Calit2 to CICESE 10G
                      Weds at 8:30am




           Calit2 VROOM-FuturePatient Expedition
PRISM@UCSD Enables
Providing Remote Access to Campus Data Repositories
Protein Data Bank (PDB) Needs
        Bandwidth to Connect Resources and Users
• Archive of experimentally
determined 3D structures of
proteins, nucleic acids, complex
assemblies
• One of the largest scientific
resources in life sciences




                                                             Virus




                                   Source: Phil Bourne and
           Hemoglobin              Andreas Prlić, PDB
PDB Usage Is Growing Over Time

•    More than 300,000 Unique Visitors per Month
•    Up to 300 Concurrent Users
•   ~10 Structures are Downloaded per Second 7/24/365
•   Increasingly Popular Web Services Traffic




               Source: Phil Bourne and Andreas Prlić, PDB
2010 FTP Traffic




RCSB PDB                   PDBe                          PDBj
159 million                34 million                    16 million
entry downloads            entry downloads               entry download
                                                                36
            Source: Phil Bourne and Andreas Prlić, PDB
PDB Plans to Establish Global Load Balancing

• Why is it Important?
   – Enables PDB to Better Serve Its Users by Providing
     Increased Reliability and Quicker Results
• How Will it be Done?
   – By More Evenly Allocating PDB Resources at Rutgers and
     UCSD
   – By Directing Users to the Closest Site
• Need High Bandwidth Between Rutgers & UCSD Facilities




               Source: Phil Bourne and Andreas Prlić, PDB
UCSD Center for Computational Mass Spectrometry
       Becoming Global MS Repository

  ProteoSAFe: Compute-intensive           MassIVE: repository and
discovery MS at the click of a button   identification platform for all
                                            MS data in the world




                                              Source:
                                           Nuno Bandeira,
                                           Vineet Bafna,
                                          Pavel Pevzner,
                                           Ingolf Krueger,
                                                UCSD
                                        proteomics.ucsd.edu
Automation:
                    Do it Billions of Times
• Large Volumes of Data from Many Sources--Must Automate
   – Thousands of Users, Tens of Thousands of Searches
   – Multi-Omics: Proteomics, Metabolomics, Proteogenomics,
     Natural Products, Glycomics, etc.
• CCMS ProteoSAFe
   – Scalable: Distributed Computation over 1000s of CPUs
   – Accessible: Intuitive Web-Based User Interfaces
   – Flexible: Easy Integration of New Analysis Workflows
• Already Analyzed >1B Spectra in >26,000 Searches
  from >2,200 users
PRISM@UCSD Enables
Enabling Remote National and International Collaborations
Tele-Collaboration for Audio Post-Production
Realtime Picture & Sound Editing Synchronized Over IP




     Skywalker Sound@Marin   Calit2@San Diego
Tele-Collaboration
                for Cinema Post-Production




Disney + Skywalker Sound + Digital Domain + Laser Pacific
   NTT Labs + UCSD/Calit2 + UIC/EVL + Pacific Interface
Collaboration Between EVL’s CAVE2
  and Calit2’s VROOM Over 10Gb Wavelength



                                                  Calit2

          EVL




Source: NTT Sponsored ON*VECTOR Workshop at Calit2 March 6, 2013
Calit2 is Linked to CICESE at 10G
Coupling OptIPortals at Each Site
            See Live Demo
        on Calit2 to CICESE 10G
            Weds at 8:30am
PRAGMA
           A Practical Collaboration Framework




                                        Build and Sustain
                                         Collaborations

                                       Advance & Improve
                                       Cyberinfrastructure
                                            Through
Source: Peter Arzberger, Calit2 UCSD
                                          Applications

A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Intensive Research

  • 1.
    “A Campus-Scale HighPerformance Cyberinfrastructure is Required for Data-Intensive Research” Keynote Presentation CENIC 2013 Held at Calit2@UCSD March 11, 2013 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering 1 Jacobs School of Engineering, UCSD
  • 2.
    “Blueprint for theDigital University”--Report of the UCSD Research Cyberinfrastructure Design Team • A Five Year Process Begins Pilot Deployment This Year April 2009 No Data Bottlenecks --Design for Gigabit/s Data Flows See talk on RCI by Richard Moore Today at 4pm research.ucsd.edu/documents/rcidt/RCIDTReportFinal2009.pdf
  • 3.
    Calit2 Sunlight OptIPuterExchange Connects 60 Campus Sites Each Dedicated at 10Gbps Maxine Brown, EVL, UIC OptIPuter Project Manager
  • 4.
    Rapid Evolution of10GbE Port Prices Makes Campus-Scale 10Gbps CI Affordable • Port Pricing is Falling $80K/port • Density is Rising – Dramatically Chiaro • Cost of 10GbE Approaching Cluster HPC (60 Max) Interconnects $ 5K Force 10 (40 max) $ 500 Arista 48 ports 2005 2007 2009 2010 2011 2013 $ 400 (48 ports – today); 576 ports (2013) Source: Philip Papadopoulos, SDSC/Calit2
  • 5.
    Arista Enables SDSC’sMassively Parallel 10G Switched Data Analysis Resource 12
  • 6.
    Partnering Opportunities withNSF: SDSC’s Gordon-Dedicated Dec. 5, 2011 • Data-Intensive Supercomputer Based on SSD Flash Memory and Virtual Shared Memory SW – Emphasizes MEM and IOPS over FLOPS – Supernode has Virtual Shared Memory: – 2 TB RAM Aggregate – 8 TB SSD Aggregate – Total Machine = 32 Supernodes – 4 PB Disk Parallel File System >100 GB/s I/O • System Designed to Accelerate Access to Massive Datasets being Generated in Many Fields of Science, Engineering, Medicine, and Social Science Source: Mike Norman, Allan Snavely SDSC
  • 7.
    Gordon Bests Previous MegaI/O per Second by 25x
  • 8.
    Creating a “BigData Freeway” System Connecting Instruments, Computers, & Storage Phil Papadopoulos, PI Larry Smarr co-PI PRISM @UCSD Start Date 1/1/13 See talk on PRISM by Phil P. Tomorrow at 9am
  • 9.
    Many Disciplines Beginningto Need Dedicated High Bandwidth on Campus How to Terminate a CENIC 100G Campus Connection • Remote Analysis of Large Data Sets – Particle Physics • Connection to Remote Campus Compute & Storage Clusters – Ocean Observatory – Microscopy and Next Gen Sequencers • Providing Remote Access to Campus Data Repositories – Protein Data Bank and Mass Spectrometry • Enabling Remote Collaborations – National and International
  • 10.
  • 11.
    CERN’s CMS Detectoris One of the World’s Most Complex Scientific Instrument See talk on LHC 100G Networks by Azher Mughal, Caltech Today at 10am
  • 12.
    CERN’s CMS Experiment GeneratesMassive Amounts of Data
  • 13.
    UCSD is aTier-2 LHC Data Center Source: Frank Wuerthwein, Physics UCSD
  • 14.
    Flow Out ofCERN for CMS Detector Peaks at 32 Gbps! 14 Source: Frank Wuerthwein, Physics UCSD
  • 15.
    CMS Flow IntoFermi Lab Peaks at 10Gbps 15 Source: Frank Wuerthwein, Physics UCSD
  • 16.
    CMS Flow intoUCSD Physics Peaks at 2.4 Gbps 16 Source: Frank Wuerthwein, Physics UCSD
  • 17.
    The Open ScienceGrid A Consortium of Universities and National Labs Open for all of science, including biology, chemistry, computer science, engineering, mathematics, medicine, and physics Source: Frank Wuerthwein, Physics UCSD
  • 18.
    Planning for climatechange in California substantial shifts on top of already high climate variability Dan Cayan USGS Water Resources Discipline Scripps Institution of Oceanography, UC San Diego much support from Mary Tyree, Mike Dettinger, Guido Franco and other colleagues Sponsors: California Energy Commission NOAA RISA program California DWR, DOE, NSF
  • 19.
    UCSD Campus ClimateResearchers Need to Download Results from Remote Supercomputer Simulations Greenhouse Gas Emissions and Concentration CMIP3 GCM’s Source: Dan Cayan, SIO UCSD
  • 20.
    GCMs ~150km Global toRegional Downscaling downscaled to Regional models ~ 12km Many simulations IPCC AR4 and IPCC AR5 have been downscaled using statistical methods INCREASING VOLUME OF CLIMATE SIMULATIONS in comparison to 4th IPCC (CMIP3) GCMs : Latest Generation CMIP5 Models Provide: More Simulations Higher Spatial Resolution More Developed Process Representation Daily Output is More Available Source: Dan Cayan, SIO UCSD
  • 21.
    average summer average summer afternoon temperature afternoon temperature GFDL A2 1km downscaled to 1km 21 Hugo Hidalgo Tapash Das Mike Dettinger
  • 22.
    HOW MUCH CALIFORNIASNOW LOSS ? Initial projections indicate substantial reduction in snow water for Sierra Nevada+ declining Apr 1 SWE: 2050 median SWE ~ 2/3 historical median 2100 median SWE ~ 1/3 historical median
  • 23.
    PRISM@UCSD Enables Connection toRemote Campus Compute & Storage Clusters
  • 24.
    The OOI CIis Built on Dedicated 10GE and Serves Researchers, Education, and Public Source: Matthew Arrott, John Orcutt OOI CI
  • 25.
    Reused Undersea OpticalCables Form a Part of the Ocean Observatories Source: John Delaney UWash OOI
  • 26.
    OOI CI isBuilt on Dedicated Optical Networks and Federal Agency & Commercial Clouds Source: John Orcutt, Matthew Arrott, SIO/Calit2
  • 27.
    OOI CI Teamat Scripps Institution of Oceanography Needs Connection to Its Server Complex in Calit2
  • 28.
    Ultra High ResolutionMicroscopy Images Created at the National Center for Microscopy Imaging
  • 29.
    Microscopes Are BigData Generators – Driving Software & Cyberinfrastructure Development Zeiss Merlin 3View w/ 32k x 32k Scanning and Automated Mosaicing: Current= 1-2 TB/week  soon 12 TB/week JEOL-4000EX w/ 8k x 8k CD, Automated Mosaicing, and Serial Tomography: Current= 1 TB/week FEI Titan w/ 4k x 4k STEM, EELS, 4k x 3.5k DDD, 4k x4k CCD, Automated Mosaicing, and Multi-tilt Tomography: Current= 1 TB/week 200-500 TB/year Raw  >2 PB/year Aggregate Source: Mark Ellisman, School of Medicine, UCSD
  • 30.
    NIH National Centerfor Microscopy & Imaging Research Integrated Infrastructure of Shared Resources Shared Infrastructure Scientific Local SOM Instruments Infrastructure End User Workstations Source: Steve Peltier, Mark Ellisman, NCMIR
  • 31.
    Agile System thatSpans Resource Classes
  • 32.
    SDSC Gordon SupercomputerAnalysis of LS Gut Microbiome Displayed on Calit2 VROOM See Live Demo on Calit2 to CICESE 10G Weds at 8:30am Calit2 VROOM-FuturePatient Expedition
  • 33.
    PRISM@UCSD Enables Providing RemoteAccess to Campus Data Repositories
  • 34.
    Protein Data Bank(PDB) Needs Bandwidth to Connect Resources and Users • Archive of experimentally determined 3D structures of proteins, nucleic acids, complex assemblies • One of the largest scientific resources in life sciences Virus Source: Phil Bourne and Hemoglobin Andreas Prlić, PDB
  • 35.
    PDB Usage IsGrowing Over Time • More than 300,000 Unique Visitors per Month • Up to 300 Concurrent Users • ~10 Structures are Downloaded per Second 7/24/365 • Increasingly Popular Web Services Traffic Source: Phil Bourne and Andreas Prlić, PDB
  • 36.
    2010 FTP Traffic RCSBPDB PDBe PDBj 159 million 34 million 16 million entry downloads entry downloads entry download 36 Source: Phil Bourne and Andreas Prlić, PDB
  • 37.
    PDB Plans toEstablish Global Load Balancing • Why is it Important? – Enables PDB to Better Serve Its Users by Providing Increased Reliability and Quicker Results • How Will it be Done? – By More Evenly Allocating PDB Resources at Rutgers and UCSD – By Directing Users to the Closest Site • Need High Bandwidth Between Rutgers & UCSD Facilities Source: Phil Bourne and Andreas Prlić, PDB
  • 38.
    UCSD Center forComputational Mass Spectrometry Becoming Global MS Repository ProteoSAFe: Compute-intensive MassIVE: repository and discovery MS at the click of a button identification platform for all MS data in the world Source: Nuno Bandeira, Vineet Bafna, Pavel Pevzner, Ingolf Krueger, UCSD proteomics.ucsd.edu
  • 39.
    Automation: Do it Billions of Times • Large Volumes of Data from Many Sources--Must Automate – Thousands of Users, Tens of Thousands of Searches – Multi-Omics: Proteomics, Metabolomics, Proteogenomics, Natural Products, Glycomics, etc. • CCMS ProteoSAFe – Scalable: Distributed Computation over 1000s of CPUs – Accessible: Intuitive Web-Based User Interfaces – Flexible: Easy Integration of New Analysis Workflows • Already Analyzed >1B Spectra in >26,000 Searches from >2,200 users
  • 40.
    PRISM@UCSD Enables Enabling RemoteNational and International Collaborations
  • 41.
    Tele-Collaboration for AudioPost-Production Realtime Picture & Sound Editing Synchronized Over IP Skywalker Sound@Marin Calit2@San Diego
  • 42.
    Tele-Collaboration for Cinema Post-Production Disney + Skywalker Sound + Digital Domain + Laser Pacific NTT Labs + UCSD/Calit2 + UIC/EVL + Pacific Interface
  • 43.
    Collaboration Between EVL’sCAVE2 and Calit2’s VROOM Over 10Gb Wavelength Calit2 EVL Source: NTT Sponsored ON*VECTOR Workshop at Calit2 March 6, 2013
  • 44.
    Calit2 is Linkedto CICESE at 10G Coupling OptIPortals at Each Site See Live Demo on Calit2 to CICESE 10G Weds at 8:30am
  • 45.
    PRAGMA A Practical Collaboration Framework Build and Sustain Collaborations Advance & Improve Cyberinfrastructure Through Source: Peter Arzberger, Calit2 UCSD Applications