A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Intensive Research


Published on

Keynote Presentation
CENIC 2013
Held at Calit2@UCSD

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Intensive Research

  1. 1. “A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Intensive Research” Keynote Presentation CENIC 2013 Held at Calit2@UCSD March 11, 2013 Dr. Larry SmarrDirector, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering 1 Jacobs School of Engineering, UCSD
  2. 2. “Blueprint for the Digital University”--Report of the UCSD Research Cyberinfrastructure Design Team• A Five Year Process Begins Pilot Deployment This Year April 2009 No DataBottlenecks--Design for Gigabit/s Data Flows See talk on RCI by Richard Moore Today at 4pm research.ucsd.edu/documents/rcidt/RCIDTReportFinal2009.pdf
  3. 3. Calit2 Sunlight OptIPuter Exchange Connects 60 Campus Sites Each Dedicated at 10Gbps Maxine Brown,EVL, UICOptIPuter ProjectManager
  4. 4. Rapid Evolution of 10GbE Port Prices Makes Campus-Scale 10Gbps CI Affordable • Port Pricing is Falling$80K/port • Density is Rising – DramaticallyChiaro • Cost of 10GbE Approaching Cluster HPC(60 Max) Interconnects $ 5K Force 10 (40 max) $ 500 Arista 48 ports2005 2007 2009 2010 20112013 $ 400 (48 ports – today); 576 ports (2013) Source: Philip Papadopoulos, SDSC/Calit2
  5. 5. Arista Enables SDSC’s Massively Parallel 10G Switched Data Analysis Resource 12
  6. 6. Partnering Opportunities with NSF: SDSC’s Gordon-Dedicated Dec. 5, 2011• Data-Intensive Supercomputer Based on SSD Flash Memory and Virtual Shared Memory SW – Emphasizes MEM and IOPS over FLOPS – Supernode has Virtual Shared Memory: – 2 TB RAM Aggregate – 8 TB SSD Aggregate – Total Machine = 32 Supernodes – 4 PB Disk Parallel File System >100 GB/s I/O• System Designed to Accelerate Access to Massive Datasets being Generated in Many Fields of Science, Engineering, Medicine, and Social Science Source: Mike Norman, Allan Snavely SDSC
  7. 7. Gordon Bests PreviousMega I/O per Second by 25x
  8. 8. Creating a “Big Data Freeway” SystemConnecting Instruments, Computers, & Storage Phil Papadopoulos, PI Larry Smarr co-PI PRISM @UCSD Start Date 1/1/13 See talk on PRISM by Phil P. Tomorrow at 9am
  9. 9. Many Disciplines Beginning to Need Dedicated High Bandwidth on Campus How to Terminate a CENIC 100G Campus Connection• Remote Analysis of Large Data Sets – Particle Physics• Connection to Remote Campus Compute & Storage Clusters – Ocean Observatory – Microscopy and Next Gen Sequencers• Providing Remote Access to Campus Data Repositories – Protein Data Bank and Mass Spectrometry• Enabling Remote Collaborations – National and International
  10. 10. PRISM@UCSD EnablesRemote Analysis of Large Data Sets
  11. 11. CERN’s CMS Detector isOne of the World’s Most Complex Scientific Instrument See talk on LHC 100G Networks by Azher Mughal, Caltech Today at 10am
  12. 12. CERN’s CMS ExperimentGenerates Massive Amounts of Data
  13. 13. UCSD is a Tier-2 LHC Data Center Source: Frank Wuerthwein, Physics UCSD
  14. 14. Flow Out of CERN for CMS Detector Peaks at 32 Gbps! 14 Source: Frank Wuerthwein, Physics UCSD
  15. 15. CMS Flow Into Fermi Lab Peaks at 10Gbps 15Source: Frank Wuerthwein, Physics UCSD
  16. 16. CMS Flow into UCSD Physics Peaks at 2.4 Gbps 16Source: Frank Wuerthwein, Physics UCSD
  17. 17. The Open Science Grid A Consortium of Universities and National LabsOpen for all of science, includingbiology, chemistry, computer science,engineering, mathematics, medicine, and physics Source: Frank Wuerthwein, Physics UCSD
  18. 18. Planning for climate change in California substantial shifts on top of already high climate variability Dan Cayan USGS Water Resources Discipline Scripps Institution of Oceanography, UC San Diego much support from Mary Tyree, Mike Dettinger, Guido Franco and other colleagues Sponsors: California Energy Commission NOAA RISA program California DWR, DOE, NSF
  19. 19. UCSD Campus Climate Researchers Need to Download Results from Remote Supercomputer Simulations Greenhouse Gas Emissions and Concentration CMIP3 GCM’s Source: Dan Cayan, SIO UCSD
  20. 20. GCMs ~150kmGlobal to Regional Downscaling downscaled to Regional models ~ 12km Many simulations IPCC AR4 and IPCC AR5 have been downscaled using statistical methodsINCREASING VOLUME OF CLIMATE SIMULATIONS in comparison to 4th IPCC (CMIP3) GCMs :Latest Generation CMIP5 Models Provide: More Simulations Higher Spatial Resolution More Developed Process Representation Daily Output is More Available Source: Dan Cayan, SIO UCSD
  21. 21. average summeraverage summer afternoon temperature afternoon temperature GFDL A2 1km downscaled to 1km 21 Hugo Hidalgo Tapash Das Mike Dettinger
  22. 22. HOW MUCH CALIFORNIA SNOW LOSS ? Initial projections indicate substantial reduction in snow water for Sierra Nevada+ declining Apr 1 SWE: 2050 median SWE ~ 2/3 historical median 2100 median SWE ~ 1/3 historical median
  23. 23. PRISM@UCSD EnablesConnection to Remote Campus Compute & Storage Clusters
  24. 24. The OOI CI is Built on Dedicated 10GEand Serves Researchers, Education, and Public Source: Matthew Arrott, John Orcutt OOI CI
  25. 25. Reused Undersea Optical CablesForm a Part of the Ocean Observatories Source: John Delaney UWash OOI
  26. 26. OOI CI is Built on Dedicated Optical Networks and Federal Agency & Commercial Clouds Source: John Orcutt,Matthew Arrott, SIO/Calit2
  27. 27. OOI CI Team at Scripps Institution of Oceanography Needs Connection to Its Server Complex in Calit2
  28. 28. Ultra High Resolution Microscopy ImagesCreated at the National Center for Microscopy Imaging
  29. 29. Microscopes Are Big Data Generators –Driving Software & Cyberinfrastructure Development Zeiss Merlin 3View w/ 32k x 32k Scanning and Automated Mosaicing: Current= 1-2 TB/week  soon 12 TB/week JEOL-4000EX w/ 8k x 8k CD, Automated Mosaicing, and Serial Tomography: Current= 1 TB/week FEI Titan w/ 4k x 4k STEM, EELS, 4k x 3.5k DDD, 4k x4k CCD, Automated Mosaicing, and Multi-tilt Tomography: Current= 1 TB/week 200-500 TB/year Raw  >2 PB/year Aggregate Source: Mark Ellisman, School of Medicine, UCSD
  30. 30. NIH National Center for Microscopy & Imaging Research Integrated Infrastructure of Shared Resources Shared Infrastructure Scientific Local SOMInstruments Infrastructure End User Workstations Source: Steve Peltier, Mark Ellisman, NCMIR
  31. 31. Agile System that Spans Resource Classes
  32. 32. SDSC Gordon Supercomputer Analysisof LS Gut Microbiome Displayed on Calit2 VROOM See Live Demo on Calit2 to CICESE 10G Weds at 8:30am Calit2 VROOM-FuturePatient Expedition
  33. 33. PRISM@UCSD EnablesProviding Remote Access to Campus Data Repositories
  34. 34. Protein Data Bank (PDB) Needs Bandwidth to Connect Resources and Users• Archive of experimentallydetermined 3D structures ofproteins, nucleic acids, complexassemblies• One of the largest scientificresources in life sciences Virus Source: Phil Bourne and Hemoglobin Andreas Prlić, PDB
  35. 35. PDB Usage Is Growing Over Time• More than 300,000 Unique Visitors per Month• Up to 300 Concurrent Users• ~10 Structures are Downloaded per Second 7/24/365• Increasingly Popular Web Services Traffic Source: Phil Bourne and Andreas Prlić, PDB
  36. 36. 2010 FTP TrafficRCSB PDB PDBe PDBj159 million 34 million 16 millionentry downloads entry downloads entry download 36 Source: Phil Bourne and Andreas Prlić, PDB
  37. 37. PDB Plans to Establish Global Load Balancing• Why is it Important? – Enables PDB to Better Serve Its Users by Providing Increased Reliability and Quicker Results• How Will it be Done? – By More Evenly Allocating PDB Resources at Rutgers and UCSD – By Directing Users to the Closest Site• Need High Bandwidth Between Rutgers & UCSD Facilities Source: Phil Bourne and Andreas Prlić, PDB
  38. 38. UCSD Center for Computational Mass Spectrometry Becoming Global MS Repository ProteoSAFe: Compute-intensive MassIVE: repository anddiscovery MS at the click of a button identification platform for all MS data in the world Source: Nuno Bandeira, Vineet Bafna, Pavel Pevzner, Ingolf Krueger, UCSD proteomics.ucsd.edu
  39. 39. Automation: Do it Billions of Times• Large Volumes of Data from Many Sources--Must Automate – Thousands of Users, Tens of Thousands of Searches – Multi-Omics: Proteomics, Metabolomics, Proteogenomics, Natural Products, Glycomics, etc.• CCMS ProteoSAFe – Scalable: Distributed Computation over 1000s of CPUs – Accessible: Intuitive Web-Based User Interfaces – Flexible: Easy Integration of New Analysis Workflows• Already Analyzed >1B Spectra in >26,000 Searches from >2,200 users
  40. 40. PRISM@UCSD EnablesEnabling Remote National and International Collaborations
  41. 41. Tele-Collaboration for Audio Post-ProductionRealtime Picture & Sound Editing Synchronized Over IP Skywalker Sound@Marin Calit2@San Diego
  42. 42. Tele-Collaboration for Cinema Post-ProductionDisney + Skywalker Sound + Digital Domain + Laser Pacific NTT Labs + UCSD/Calit2 + UIC/EVL + Pacific Interface
  43. 43. Collaboration Between EVL’s CAVE2 and Calit2’s VROOM Over 10Gb Wavelength Calit2 EVLSource: NTT Sponsored ON*VECTOR Workshop at Calit2 March 6, 2013
  44. 44. Calit2 is Linked to CICESE at 10GCoupling OptIPortals at Each Site See Live Demo on Calit2 to CICESE 10G Weds at 8:30am
  45. 45. PRAGMA A Practical Collaboration Framework Build and Sustain Collaborations Advance & Improve Cyberinfrastructure ThroughSource: Peter Arzberger, Calit2 UCSD Applications