From the Heroic to the Logistical Programming Model Implications  of New Supercomputing Applications  Ian Foster Computation Institute Argonne National Lab & University of Chicago
Abstract High-performance computers such as the petascale systems being installed at DOE and NSF centers in the US are conventionally focused on “heroic” computations in which many processors are applied to a single task. Yet a growing number of science applications are equally concerned with “logistical” issues: that is, with the high-performance and reliable execution of many tasks that operate on large shared data and/or are linked by communication-intensive producer-consumer relations. Such applications may require the extreme computational capacity and specialized communication fabrics of petascale computers, but are not easily expressed using conventional parallel programming models such as MPI. To enable the use of high-performance computers for these applications, we need new methods for the efficient dispatch, coupling, and management of large numbers of communication-intensive tasks. I discuss how work on scripting languages, high-throughput computing, and parallel I/O can be combined to build new tools that enable the efficient and reliable execution of applications involving from hundreds to millions of uniprocessor and multiprocessor tasks, with aggregate communication requirements of tens of gigabytes per second. I illustrate my presentation by referring to our experiences adapting the Swift parallel programming system (www.ci.uchicago.edu/swift) for efficient execution in both large-scale grid and petascale cluster environments.
What will we do with 1+ Exaflops and 1M+ cores?
Or, If You Prefer, A Worldwide Grid (or Cloud) EGEE
1) Tackle  Bigger and Bigger  Problems Computational Scientist as  Hero
2) Tackle  More Complex  Problems Computational Scientist as  Logistics Officer
“More Complex Problems” Ensemble runs to quantify  climate model uncertainty Identify  potential drug targets  by screening a database of ligand structures against target proteins Study  economic model sensitivity  to parameters Analyze  turbulence dataset  from many perspectives Perform  numerical optimization  to determine optimal resource assignment in energy problems Mine collection of data from  advanced light sources   Construct databases of computed properties of  chemical compounds Analyze data from the  Large Hadron Collider Analyze  log data  from 100,000-node parallel computations
Programming Model Issues Massive  task parallelism Massive  data parallelism Integrating  black box applications Complex  task dependencies  (task graphs) Failure , and other execution management issues Data management : input, intermediate, output Dynamic computations  (task graphs) Dynamic data access  to large, diverse datasets Long-running  computations Documenting  provenance  of data products
Problem Types Number of tasks Input data size 1  1K  1M  Hi Med Lo Heroic MPI tasks Data analysis, mining Many loosely coupled tasks  Much data and  complex tasks
An Incomplete and Simplistic View of Programming Models and Tools Many Tasks DAGMan+Pegasus Karajan+Swift Much Data MapReduce/Hadoop Dryad Complex Tasks, Much Data Dryad, Pig, Sawzall Swift+Falkon Single task, modest data MPI, etc., etc., etc.
Many Tasks Climate  Ensemble  Simulations (Using FOAM, 2005) Image courtesy Pat Behling and Yun Liu, UW Madison NCAR computer + grad student 160 ensemble members in 75 days TeraGrid + “Virtual Data System” 250 ensemble members in 4 days
Many Many Tasks: Identifying Potential Drug Targets 2M+ ligands Protein  x target(s)  (Mike Kubal, Benoit Roux, and others)
start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues,  #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1  protein (1MB) PDB protein descriptions For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) 6  GB 2M  structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
DOCK on SiCortex CPU cores: 5760 Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years Average task time: 660.3 sec (does not include ~800 sec to stage input data) Ioan Raicu Zhao Zhang
DOCK on BG/P: ~1M Tasks on 118,000 CPUs CPU cores: 118784 Tasks: 934803 Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to 32 racks) Utilization:  Sustained: 99.6% Overall: 78.3% GPFS 1 script (~5KB) 2 file read (~10KB) 1 file write (~10KB) RAM (cached from GPFS on first task per node) 1 binary (~7MB) Static input data (~45MB) Ioan Raicu Zhao Zhang Mike Wilde Time (secs)
Managing 120K CPUs Slower shared storage High-speed local disk Falkon
MARS Economic Model  Parameter Study 2,048 BG/P CPU cores Tasks: 49,152 Micro-tasks: 7,077,888 Elapsed time: 1,601 secs CPU Hours: 894 Zhao Zhang Mike Wilde
AstroPortal Stacking Service Purpose On-demand “stacks” of random locations within ~10TB dataset Challenge Rapid access to 10-10K “random” files Time-varying load Sample Workloads S 4 Sloan Data Web page  or Web Service + + + + + + = +
AstroPortal Stacking Service with Data Diffusion Aggregate throughput: 39Gb/s 10X higher than GPFS Reduced load on GPFS 0.49Gb/s 1/10 of the original load Big performance gains as locality increases Ioan Raicu, 11:15am TOMORROW
B. Berriman, J. Good (Caltech) J. Jacob, D. Katz (JPL)
Montage Benchmark (Yong Zhao, Ioan Raicu, U.Chicago) MPI : ~950 lines of C for one stage Pegasus : ~1200 lines of C + tools to generate DAG for specific dataset  SwiftScript : ~92 lines for any dataset
Summary Peta- and exa-scale computers enable us to tackle new problems at greater scales Parameter studies, ensembles, interactive data analysis, “workflows” of various kinds Such apps frequently stress petascale hardware and software in interesting ways New programming models and tools required Mixed task/data parallelism, task management complex data management, failure, …  Tools (DAGman, Swift, Hadoop, …) exist but need refinement Interesting connections to distributed systems More info: www.ci.uchicago.edu/swift
 

Many Task Applications for Grids and Supercomputers

  • 1.
    From the Heroicto the Logistical Programming Model Implications of New Supercomputing Applications Ian Foster Computation Institute Argonne National Lab & University of Chicago
  • 2.
    Abstract High-performance computerssuch as the petascale systems being installed at DOE and NSF centers in the US are conventionally focused on “heroic” computations in which many processors are applied to a single task. Yet a growing number of science applications are equally concerned with “logistical” issues: that is, with the high-performance and reliable execution of many tasks that operate on large shared data and/or are linked by communication-intensive producer-consumer relations. Such applications may require the extreme computational capacity and specialized communication fabrics of petascale computers, but are not easily expressed using conventional parallel programming models such as MPI. To enable the use of high-performance computers for these applications, we need new methods for the efficient dispatch, coupling, and management of large numbers of communication-intensive tasks. I discuss how work on scripting languages, high-throughput computing, and parallel I/O can be combined to build new tools that enable the efficient and reliable execution of applications involving from hundreds to millions of uniprocessor and multiprocessor tasks, with aggregate communication requirements of tens of gigabytes per second. I illustrate my presentation by referring to our experiences adapting the Swift parallel programming system (www.ci.uchicago.edu/swift) for efficient execution in both large-scale grid and petascale cluster environments.
  • 3.
    What will wedo with 1+ Exaflops and 1M+ cores?
  • 4.
    Or, If YouPrefer, A Worldwide Grid (or Cloud) EGEE
  • 5.
    1) Tackle Bigger and Bigger Problems Computational Scientist as Hero
  • 6.
    2) Tackle More Complex Problems Computational Scientist as Logistics Officer
  • 7.
    “More Complex Problems”Ensemble runs to quantify climate model uncertainty Identify potential drug targets by screening a database of ligand structures against target proteins Study economic model sensitivity to parameters Analyze turbulence dataset from many perspectives Perform numerical optimization to determine optimal resource assignment in energy problems Mine collection of data from advanced light sources Construct databases of computed properties of chemical compounds Analyze data from the Large Hadron Collider Analyze log data from 100,000-node parallel computations
  • 8.
    Programming Model IssuesMassive task parallelism Massive data parallelism Integrating black box applications Complex task dependencies (task graphs) Failure , and other execution management issues Data management : input, intermediate, output Dynamic computations (task graphs) Dynamic data access to large, diverse datasets Long-running computations Documenting provenance of data products
  • 9.
    Problem Types Numberof tasks Input data size 1 1K 1M Hi Med Lo Heroic MPI tasks Data analysis, mining Many loosely coupled tasks Much data and complex tasks
  • 10.
    An Incomplete andSimplistic View of Programming Models and Tools Many Tasks DAGMan+Pegasus Karajan+Swift Much Data MapReduce/Hadoop Dryad Complex Tasks, Much Data Dryad, Pig, Sawzall Swift+Falkon Single task, modest data MPI, etc., etc., etc.
  • 11.
    Many Tasks Climate Ensemble Simulations (Using FOAM, 2005) Image courtesy Pat Behling and Yun Liu, UW Madison NCAR computer + grad student 160 ensemble members in 75 days TeraGrid + “Virtual Data System” 250 ensemble members in 4 days
  • 12.
    Many Many Tasks:Identifying Potential Drug Targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)
  • 13.
    start report DOCK6Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
  • 14.
    DOCK on SiCortexCPU cores: 5760 Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years Average task time: 660.3 sec (does not include ~800 sec to stage input data) Ioan Raicu Zhao Zhang
  • 15.
    DOCK on BG/P:~1M Tasks on 118,000 CPUs CPU cores: 118784 Tasks: 934803 Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to 32 racks) Utilization: Sustained: 99.6% Overall: 78.3% GPFS 1 script (~5KB) 2 file read (~10KB) 1 file write (~10KB) RAM (cached from GPFS on first task per node) 1 binary (~7MB) Static input data (~45MB) Ioan Raicu Zhao Zhang Mike Wilde Time (secs)
  • 16.
    Managing 120K CPUsSlower shared storage High-speed local disk Falkon
  • 17.
    MARS Economic Model Parameter Study 2,048 BG/P CPU cores Tasks: 49,152 Micro-tasks: 7,077,888 Elapsed time: 1,601 secs CPU Hours: 894 Zhao Zhang Mike Wilde
  • 19.
    AstroPortal Stacking ServicePurpose On-demand “stacks” of random locations within ~10TB dataset Challenge Rapid access to 10-10K “random” files Time-varying load Sample Workloads S 4 Sloan Data Web page or Web Service + + + + + + = +
  • 20.
    AstroPortal Stacking Servicewith Data Diffusion Aggregate throughput: 39Gb/s 10X higher than GPFS Reduced load on GPFS 0.49Gb/s 1/10 of the original load Big performance gains as locality increases Ioan Raicu, 11:15am TOMORROW
  • 21.
    B. Berriman, J.Good (Caltech) J. Jacob, D. Katz (JPL)
  • 22.
    Montage Benchmark (YongZhao, Ioan Raicu, U.Chicago) MPI : ~950 lines of C for one stage Pegasus : ~1200 lines of C + tools to generate DAG for specific dataset SwiftScript : ~92 lines for any dataset
  • 23.
    Summary Peta- andexa-scale computers enable us to tackle new problems at greater scales Parameter studies, ensembles, interactive data analysis, “workflows” of various kinds Such apps frequently stress petascale hardware and software in interesting ways New programming models and tools required Mixed task/data parallelism, task management complex data management, failure, … Tools (DAGman, Swift, Hadoop, …) exist but need refinement Interesting connections to distributed systems More info: www.ci.uchicago.edu/swift
  • 24.

Editor's Notes

  • #2 Ken Wilson observed: computational science a third mode of enquiry in addition to experiment and theory. My theme is rather how, by taking a systems view of the knowledge generation process, we can identify ways in which computation can accelerate.