Overview of Cyberinfrastructure and the Breadth of Its Application  Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department Director Community Grids Laboratory and Digital Science Center Indiana University Bloomington IN 47404 (Presenter:  Marlon Pierce ) [email_address] http://www.infomall.org [email_address]
Time Parallel Computing Grids and Federated Computing Scientific Enterprise Computing Scientific  Web 2.0 Cloud Computing Parallel Computing Evolution of Scientific Computing, 1985-2010 Evidence of Intelligent Design? Y-Axis is whatever you want it to be.
What is High Performance Computing? The meaning of this was clear 20 years ago when we were planning/starting the  HPCC  (High Performance Computing and Communication) Initiative It meant  parallel computing  and HPCC lasted for 10 years As an outgrowth of this, NSF started funding of  supercomputer centers  and we debated vector versus “massively parallel systems”. Data did not exist …. TeraGrid is the current incarnation. NSF subsequently established the  Office of Cyberinfrastructure Comprehensive approach to physical infrastructure Complementary NSF concept  “Computational Thinking”  Everyone needs cyberinfrastructure Core idea is always  connecting resources through messages : MPI, JMS, XML, Twitter, etc.
TeraGrid High Performance Computing Systems 2007-8 Computational Resources  (size approximate - not to scale) Slide Courtesy Tommy Minyard, TACC SDSC TACC NCSA ORNL PU IU PSC NCAR (504TF) 2008 (~1PF) Tennessee LONI/LSU UC/ANL
Resources for many disciplines! > 120,000 processors in aggregate Resource availability grew during 2008 at unprecedented rates
TOTEM pp, general  purpose; HI LHCb: B-physics ALICE : HI pp   s =14 TeV  L=10 34  cm -2  s -1 27 km Tunnel in Switzerland & France Large Hadron Collider  CERN, Geneva: 2008 Start  CMS Atlas Higgs, SUSY, Extra Dimensions, CP Violation, QG Plasma,  …  the Unexpected 5000+  Physicists 250+  Institutes 60+ Countries Challenges: Analyze petabytes of complex data cooperatively Harness global computing, data & network resources
Linked Environments for Atmospheric Discovery Grid services  triggered by abnormal events and controlled by  workflow  process real time data from radar and high resolution simulations for tornado forecasts Typical graphical interface to service composition
Environmental Monitoring Cyberinfrastructure at Clemson
Forces on Cyberinfrastructure:  Clouds, Multicore, and Web 2.0
Gartner 2008 Technology Hype Curve Clouds, Microblogs and Green IT appear Basic Web Services, Wikis and SOA  becoming mainstream
Gartner’s 2005 Hype Curve
Relevance of Web 2.0 Web 2.0  can  help e-Research  in many ways Its tools (web sites) can enhance scientific collaboration, i.e. effectively  support virtual organizations , in different ways from grids The popularity of Web 2.0 can provide  high quality technologies and software  that (due to large commercial investment) can be very useful in e-Research and preferable to complex Grid or Web Service solutions The  usability  and  participatory  nature of Web 2.0 can bring science and its informatics to a  broader audience Cyberinfrastructure is research analogue of major commercial initiatives e.g. to  important job opportunities  for students!
Enterprise Approach Web 2.0 Approach JSR 168 Portlets Google Gadgets, Widgets, badges Server-side integration and processing AJAX, client-side integration and processing, JavaScript SOAP RSS, Atom, JSON WSDL REST (GET, PUT, DELETE, POST) Portlet Containers Open Social Containers (Orkut, LinkedIn, Shindig);  Facebook; StartPages User Centric Gateways Social Networking Portals Workflow managers (Taverna, Kepler, XBaya, etc) Mash-ups WS-Eventing, WS-Notification, Enterprise Messaging Blogging and Micro-blogging with REST, RSS/Atom, and JSON messages (Blogger, Twitter) Semantic Web: RDF, OWL, ontologies Microformats, folksonomies
Cloud Computing: Infrastructure and Runtimes Cloud infrastructure:  outsourcing of servers, computing, data, file space, etc. Handled through Web services that control virtual machine lifecycles. Cloud runtimes:  tools for using clouds to do data-parallel computations.  Apache Hadoop, Google MapReduce, Microsoft Dryad, and others  Designed for information retrieval but are excellent for a wide range of machine learning and science applications. Apache Mahout Also may be a good match for 32-128 core computers available in the next 5 years.
Some Commercial Clouds Bold faced entries have open source equivalents  Cloud/ Service Amazon Microsoft Azure Google (and Apache) Data S3, EBS, SimpleDB Blob, Table, SQL Services GFS, BigTable Computing EC2, Elastic Map Reduce (runs Hadoop) Compute Service MapReduce (not public, but Hadoop) Service Hosting Amazon Load Balancing Web Hosting Service AppEngine/AppDrop
Clouds as Cost Effective Data Centers Exploit the Internet by allowing one to build giant data centers with 100,000’s of computers; ~ 200-1000 to a shipping container “ Microsoft will cram between  150 and 220 shipping containers filled with data center gear  into a new 500,000 square foot Chicago facility. This move marks the most significant, public use of the shipping container systems popularized by the likes of Sun Microsystems and Rackable Systems to date.”
Clouds Hide Complexity Build portals around all computing capability SaaS :  Software  as a  Service IaaS :  Infrastructure  as a  Service  or  HaaS :  Hardware  as a  Service PaaS :  Platform  as a  Service  delivers  SaaS on IaaS Cyberinfrastructure  is  “Research as a Service” 2 Google warehouses of computers on the banks of the Columbia River, in The Dalles, Oregon Such centers use 20MW-200MW  (Future) each  150 watts per core Save money from large size, positioning with cheap power and access with Internet
Open Architecture Clouds Amazon, Google, Microsoft, et al., don’t tell you how to build a cloud. Proprietary knowledge Indiana University and others want to  document this publically .  What is the right way to build a cloud? It is more than just running software. What is the  minimum-sized organization  to run a cloud? Department?  University? University Consortium? Outsource it all? Analogous issues in government, industry, and enterprise. Example issues: What hardware setups work best?  What are you getting into? What is the best virtualization technology for different problems?
Data-File Parallelism and Clouds Now that you have a cloud, you may want to do large scale processing with it. Classic problems are to perform  the same (sequential) algorithm on fragments  of extremely large data sets. Cloud runtime engines  manage these replicated algorithms in the cloud. Can be chained together in  pipelines  (Hadoop) or  DAGs  (Dryad). Runtimes manage problems like failure control. We are exploring both scientific applications and classic parallel algorithms (clustering, matrix multiplication) using Clouds and cloud runtimes.
Data Intensive Research Research is advanced by observation i.e. analyzing data from Gene Sequencers Accelerators Telescopes  Environmental Sensors Web Crawlers Ethnographic Interviews This data is  “filtered”, “analyzed”, “data mined”  (term used in Computer Science) to produce conclusions Weather forecasting and Climate prediction are of this type
Geospatial Examples Image processing and mining Ex: SAR Images from Polar Grid project (J. Wang)  Apply to 20 TB of data Flood modeling I Chaining flood models over a geographic area.  Flood modeling II Parameter fits and inversion problems. Real time GPS processing Filter
Parallel Clustering and  Parallel Multidimensional Scaling MDS 4500 Points : Clustal MSA 3000 Points : Clustal MSA Kimura2 Distance Applied to ~5000 dimensional gene sequences and ~20 dimensional patient record data Very good parallel speedup 4000 Points : Patient Record Data on Obesity and Environment 4500 Points : Pairwise Aligned
Some Other File/Data Parallel Examples from Indiana University Biology Dept EST (Expressed Sequence Tag) Assembly:  (Dong) 2 million mRNA sequences generates 540000 files taking 15 hours on 400 TeraGrid nodes (CAP3 run dominates) MultiParanoid/InParanoid gene sequence clustering:  (Dong) 476 core years just for Prokaryotes Population Genomics:  (Lynch) Looking at all pairs separated by up to 1000 nucleotides Sequence-based transcriptome profiling:  (Cherbas, Innes) MAQ, SOAP Systems Microbiology:  (Brun) BLAST, InterProScan Metagenomics  (Fortenberry, Nelson) Pairwise alignment of 7243 16s sequence data took 12 hours on TeraGrid All can use Dryad or Hadoop
Intel’s Projection Technology might support: 2010: 16—64 cores  200GF—1 TF 2013: 64—256 cores  500GF– 4 TF 2016: 256--1024 cores  2 TF– 20 TF
Too much Computing? Historically both grids and parallel computing have tried to  increase computing capabilities  by Optimizing performance  of codes at  cost  of  re-usability Exploiting all possible CPU’s such as Graphics co-processors and “ idle cycles ” (across administrative domains) Linking central computers together such as NSF/DoE/DoD supercomputer networks  without clear user requirements Next  Crisis in technology area  will be the  opposite problem  – commodity chips will be  32-128way parallel  in 5 years time and we currently have  no idea how to use them  on commodity systems – especially on clients Only 2 releases of standard software (e.g. Office) in this time span so need solutions that can be implemented in next 3-5 years Intel RMS  analysis : Gaming  and  Generalized decision support  (data mining) are ways of using these cycles

Cyberinfrastructure and Applications Overview: Howard University June22

  • 1.
    Overview of Cyberinfrastructureand the Breadth of Its Application Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department Director Community Grids Laboratory and Digital Science Center Indiana University Bloomington IN 47404 (Presenter: Marlon Pierce ) [email_address] http://www.infomall.org [email_address]
  • 2.
    Time Parallel ComputingGrids and Federated Computing Scientific Enterprise Computing Scientific Web 2.0 Cloud Computing Parallel Computing Evolution of Scientific Computing, 1985-2010 Evidence of Intelligent Design? Y-Axis is whatever you want it to be.
  • 3.
    What is HighPerformance Computing? The meaning of this was clear 20 years ago when we were planning/starting the HPCC (High Performance Computing and Communication) Initiative It meant parallel computing and HPCC lasted for 10 years As an outgrowth of this, NSF started funding of supercomputer centers and we debated vector versus “massively parallel systems”. Data did not exist …. TeraGrid is the current incarnation. NSF subsequently established the Office of Cyberinfrastructure Comprehensive approach to physical infrastructure Complementary NSF concept “Computational Thinking” Everyone needs cyberinfrastructure Core idea is always connecting resources through messages : MPI, JMS, XML, Twitter, etc.
  • 4.
    TeraGrid High PerformanceComputing Systems 2007-8 Computational Resources (size approximate - not to scale) Slide Courtesy Tommy Minyard, TACC SDSC TACC NCSA ORNL PU IU PSC NCAR (504TF) 2008 (~1PF) Tennessee LONI/LSU UC/ANL
  • 5.
    Resources for manydisciplines! > 120,000 processors in aggregate Resource availability grew during 2008 at unprecedented rates
  • 6.
    TOTEM pp, general purpose; HI LHCb: B-physics ALICE : HI pp  s =14 TeV L=10 34 cm -2 s -1 27 km Tunnel in Switzerland & France Large Hadron Collider CERN, Geneva: 2008 Start CMS Atlas Higgs, SUSY, Extra Dimensions, CP Violation, QG Plasma, … the Unexpected 5000+ Physicists 250+ Institutes 60+ Countries Challenges: Analyze petabytes of complex data cooperatively Harness global computing, data & network resources
  • 7.
    Linked Environments forAtmospheric Discovery Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts Typical graphical interface to service composition
  • 9.
  • 11.
    Forces on Cyberinfrastructure: Clouds, Multicore, and Web 2.0
  • 12.
    Gartner 2008 TechnologyHype Curve Clouds, Microblogs and Green IT appear Basic Web Services, Wikis and SOA becoming mainstream
  • 13.
  • 14.
    Relevance of Web2.0 Web 2.0 can help e-Research in many ways Its tools (web sites) can enhance scientific collaboration, i.e. effectively support virtual organizations , in different ways from grids The popularity of Web 2.0 can provide high quality technologies and software that (due to large commercial investment) can be very useful in e-Research and preferable to complex Grid or Web Service solutions The usability and participatory nature of Web 2.0 can bring science and its informatics to a broader audience Cyberinfrastructure is research analogue of major commercial initiatives e.g. to important job opportunities for students!
  • 15.
    Enterprise Approach Web2.0 Approach JSR 168 Portlets Google Gadgets, Widgets, badges Server-side integration and processing AJAX, client-side integration and processing, JavaScript SOAP RSS, Atom, JSON WSDL REST (GET, PUT, DELETE, POST) Portlet Containers Open Social Containers (Orkut, LinkedIn, Shindig); Facebook; StartPages User Centric Gateways Social Networking Portals Workflow managers (Taverna, Kepler, XBaya, etc) Mash-ups WS-Eventing, WS-Notification, Enterprise Messaging Blogging and Micro-blogging with REST, RSS/Atom, and JSON messages (Blogger, Twitter) Semantic Web: RDF, OWL, ontologies Microformats, folksonomies
  • 16.
    Cloud Computing: Infrastructureand Runtimes Cloud infrastructure: outsourcing of servers, computing, data, file space, etc. Handled through Web services that control virtual machine lifecycles. Cloud runtimes: tools for using clouds to do data-parallel computations. Apache Hadoop, Google MapReduce, Microsoft Dryad, and others Designed for information retrieval but are excellent for a wide range of machine learning and science applications. Apache Mahout Also may be a good match for 32-128 core computers available in the next 5 years.
  • 17.
    Some Commercial CloudsBold faced entries have open source equivalents Cloud/ Service Amazon Microsoft Azure Google (and Apache) Data S3, EBS, SimpleDB Blob, Table, SQL Services GFS, BigTable Computing EC2, Elastic Map Reduce (runs Hadoop) Compute Service MapReduce (not public, but Hadoop) Service Hosting Amazon Load Balancing Web Hosting Service AppEngine/AppDrop
  • 18.
    Clouds as CostEffective Data Centers Exploit the Internet by allowing one to build giant data centers with 100,000’s of computers; ~ 200-1000 to a shipping container “ Microsoft will cram between 150 and 220 shipping containers filled with data center gear into a new 500,000 square foot Chicago facility. This move marks the most significant, public use of the shipping container systems popularized by the likes of Sun Microsystems and Rackable Systems to date.”
  • 19.
    Clouds Hide ComplexityBuild portals around all computing capability SaaS : Software as a Service IaaS : Infrastructure as a Service or HaaS : Hardware as a Service PaaS : Platform as a Service delivers SaaS on IaaS Cyberinfrastructure is “Research as a Service” 2 Google warehouses of computers on the banks of the Columbia River, in The Dalles, Oregon Such centers use 20MW-200MW (Future) each 150 watts per core Save money from large size, positioning with cheap power and access with Internet
  • 20.
    Open Architecture CloudsAmazon, Google, Microsoft, et al., don’t tell you how to build a cloud. Proprietary knowledge Indiana University and others want to document this publically . What is the right way to build a cloud? It is more than just running software. What is the minimum-sized organization to run a cloud? Department? University? University Consortium? Outsource it all? Analogous issues in government, industry, and enterprise. Example issues: What hardware setups work best? What are you getting into? What is the best virtualization technology for different problems?
  • 21.
    Data-File Parallelism andClouds Now that you have a cloud, you may want to do large scale processing with it. Classic problems are to perform the same (sequential) algorithm on fragments of extremely large data sets. Cloud runtime engines manage these replicated algorithms in the cloud. Can be chained together in pipelines (Hadoop) or DAGs (Dryad). Runtimes manage problems like failure control. We are exploring both scientific applications and classic parallel algorithms (clustering, matrix multiplication) using Clouds and cloud runtimes.
  • 22.
    Data Intensive ResearchResearch is advanced by observation i.e. analyzing data from Gene Sequencers Accelerators Telescopes Environmental Sensors Web Crawlers Ethnographic Interviews This data is “filtered”, “analyzed”, “data mined” (term used in Computer Science) to produce conclusions Weather forecasting and Climate prediction are of this type
  • 23.
    Geospatial Examples Imageprocessing and mining Ex: SAR Images from Polar Grid project (J. Wang) Apply to 20 TB of data Flood modeling I Chaining flood models over a geographic area. Flood modeling II Parameter fits and inversion problems. Real time GPS processing Filter
  • 24.
    Parallel Clustering and Parallel Multidimensional Scaling MDS 4500 Points : Clustal MSA 3000 Points : Clustal MSA Kimura2 Distance Applied to ~5000 dimensional gene sequences and ~20 dimensional patient record data Very good parallel speedup 4000 Points : Patient Record Data on Obesity and Environment 4500 Points : Pairwise Aligned
  • 25.
    Some Other File/DataParallel Examples from Indiana University Biology Dept EST (Expressed Sequence Tag) Assembly: (Dong) 2 million mRNA sequences generates 540000 files taking 15 hours on 400 TeraGrid nodes (CAP3 run dominates) MultiParanoid/InParanoid gene sequence clustering: (Dong) 476 core years just for Prokaryotes Population Genomics: (Lynch) Looking at all pairs separated by up to 1000 nucleotides Sequence-based transcriptome profiling: (Cherbas, Innes) MAQ, SOAP Systems Microbiology: (Brun) BLAST, InterProScan Metagenomics (Fortenberry, Nelson) Pairwise alignment of 7243 16s sequence data took 12 hours on TeraGrid All can use Dryad or Hadoop
  • 26.
    Intel’s Projection Technologymight support: 2010: 16—64 cores 200GF—1 TF 2013: 64—256 cores 500GF– 4 TF 2016: 256--1024 cores 2 TF– 20 TF
  • 27.
    Too much Computing?Historically both grids and parallel computing have tried to increase computing capabilities by Optimizing performance of codes at cost of re-usability Exploiting all possible CPU’s such as Graphics co-processors and “ idle cycles ” (across administrative domains) Linking central computers together such as NSF/DoE/DoD supercomputer networks without clear user requirements Next Crisis in technology area will be the opposite problem – commodity chips will be 32-128way parallel in 5 years time and we currently have no idea how to use them on commodity systems – especially on clients Only 2 releases of standard software (e.g. Office) in this time span so need solutions that can be implemented in next 3-5 years Intel RMS analysis : Gaming and Generalized decision support (data mining) are ways of using these cycles