From the Heroic to the Logistical Programming Model Implications  of New Supercomputing Applications  Ian Foster Computati...
Abstract <ul><li>High-performance computers such as the petascale systems being installed at DOE and NSF centers in the US...
What will we do with 1+ Exaflops and 1M+ cores?
Or, If You Prefer, A Worldwide Grid (or Cloud) EGEE
1) Tackle  Bigger and Bigger  Problems Computational Scientist as  Hero
2) Tackle  More Complex  Problems Computational Scientist as  Logistics Officer
“More Complex Problems” <ul><li>Ensemble runs to quantify  climate model uncertainty </li></ul><ul><li>Identify  potential...
Programming Model Issues <ul><li>Massive  task parallelism </li></ul><ul><li>Massive  data parallelism </li></ul><ul><li>I...
Problem Types Number of tasks Input data size 1  1K  1M  Hi Med Lo Heroic MPI tasks Data analysis, mining Many loosely cou...
An Incomplete and Simplistic View of Programming Models and Tools Many Tasks DAGMan+Pegasus Karajan+Swift Much Data MapRed...
Many Tasks Climate  Ensemble  Simulations (Using FOAM, 2005) Image courtesy Pat Behling and Yun Liu, UW Madison NCAR compu...
Many Many Tasks: Identifying Potential Drug Targets 2M+ ligands Protein  x target(s)  (Mike Kubal, Benoit Roux, and others)
start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script pa...
DOCK on SiCortex <ul><li>CPU cores: 5760 </li></ul><ul><li>Tasks: 92160 </li></ul><ul><li>Elapsed time: 12821 sec </li></u...
DOCK on BG/P: ~1M Tasks on 118,000 CPUs <ul><li>CPU cores: 118784 </li></ul><ul><li>Tasks: 934803 </li></ul><ul><li>Elapse...
Managing 120K CPUs Slower shared storage High-speed local disk Falkon
MARS Economic Model  Parameter Study <ul><li>2,048 BG/P CPU cores </li></ul><ul><li>Tasks: 49,152 </li></ul><ul><li>Micro-...
AstroPortal Stacking Service <ul><li>Purpose </li></ul><ul><ul><li>On-demand “stacks” of random locations within ~10TB dat...
AstroPortal Stacking Service with Data Diffusion <ul><li>Aggregate throughput: </li></ul><ul><ul><li>39Gb/s </li></ul></ul...
B. Berriman, J. Good (Caltech) J. Jacob, D. Katz (JPL)
Montage Benchmark (Yong Zhao, Ioan Raicu, U.Chicago) MPI : ~950 lines of C for one stage Pegasus : ~1200 lines of C + tool...
Summary <ul><li>Peta- and exa-scale computers enable us to tackle new problems at greater scales </li></ul><ul><ul><li>Par...
 
Upcoming SlideShare
Loading in …5
×

Many Task Applications for Grids and Supercomputers

1,557 views
1,515 views

Published on

Invited talk at CLADE 2008 in Boston: http://www.cs.okstate.edu/clade2008/.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,557
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
43
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Ken Wilson observed: computational science a third mode of enquiry in addition to experiment and theory. My theme is rather how, by taking a systems view of the knowledge generation process, we can identify ways in which computation can accelerate.
  • Many Task Applications for Grids and Supercomputers

    1. 1. From the Heroic to the Logistical Programming Model Implications of New Supercomputing Applications Ian Foster Computation Institute Argonne National Lab & University of Chicago
    2. 2. Abstract <ul><li>High-performance computers such as the petascale systems being installed at DOE and NSF centers in the US are conventionally focused on “heroic” computations in which many processors are applied to a single task. Yet a growing number of science applications are equally concerned with “logistical” issues: that is, with the high-performance and reliable execution of many tasks that operate on large shared data and/or are linked by communication-intensive producer-consumer relations. Such applications may require the extreme computational capacity and specialized communication fabrics of petascale computers, but are not easily expressed using conventional parallel programming models such as MPI. To enable the use of high-performance computers for these applications, we need new methods for the efficient dispatch, coupling, and management of large numbers of communication-intensive tasks. I discuss how work on scripting languages, high-throughput computing, and parallel I/O can be combined to build new tools that enable the efficient and reliable execution of applications involving from hundreds to millions of uniprocessor and multiprocessor tasks, with aggregate communication requirements of tens of gigabytes per second. I illustrate my presentation by referring to our experiences adapting the Swift parallel programming system (www.ci.uchicago.edu/swift) for efficient execution in both large-scale grid and petascale cluster environments. </li></ul>
    3. 3. What will we do with 1+ Exaflops and 1M+ cores?
    4. 4. Or, If You Prefer, A Worldwide Grid (or Cloud) EGEE
    5. 5. 1) Tackle Bigger and Bigger Problems Computational Scientist as Hero
    6. 6. 2) Tackle More Complex Problems Computational Scientist as Logistics Officer
    7. 7. “More Complex Problems” <ul><li>Ensemble runs to quantify climate model uncertainty </li></ul><ul><li>Identify potential drug targets by screening a database of ligand structures against target proteins </li></ul><ul><li>Study economic model sensitivity to parameters </li></ul><ul><li>Analyze turbulence dataset from many perspectives </li></ul><ul><li>Perform numerical optimization to determine optimal resource assignment in energy problems </li></ul><ul><li>Mine collection of data from advanced light sources </li></ul><ul><li>Construct databases of computed properties of chemical compounds </li></ul><ul><li>Analyze data from the Large Hadron Collider </li></ul><ul><li>Analyze log data from 100,000-node parallel computations </li></ul>
    8. 8. Programming Model Issues <ul><li>Massive task parallelism </li></ul><ul><li>Massive data parallelism </li></ul><ul><li>Integrating black box applications </li></ul><ul><li>Complex task dependencies (task graphs) </li></ul><ul><li>Failure , and other execution management issues </li></ul><ul><li>Data management : input, intermediate, output </li></ul><ul><li>Dynamic computations (task graphs) </li></ul><ul><li>Dynamic data access to large, diverse datasets </li></ul><ul><li>Long-running computations </li></ul><ul><li>Documenting provenance of data products </li></ul>
    9. 9. Problem Types Number of tasks Input data size 1 1K 1M Hi Med Lo Heroic MPI tasks Data analysis, mining Many loosely coupled tasks Much data and complex tasks
    10. 10. An Incomplete and Simplistic View of Programming Models and Tools Many Tasks DAGMan+Pegasus Karajan+Swift Much Data MapReduce/Hadoop Dryad Complex Tasks, Much Data Dryad, Pig, Sawzall Swift+Falkon Single task, modest data MPI, etc., etc., etc.
    11. 11. Many Tasks Climate Ensemble Simulations (Using FOAM, 2005) Image courtesy Pat Behling and Yun Liu, UW Madison NCAR computer + grad student 160 ensemble members in 75 days TeraGrid + “Virtual Data System” 250 ensemble members in 4 days
    12. 12. Many Many Tasks: Identifying Potential Drug Targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)
    13. 13. start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
    14. 14. DOCK on SiCortex <ul><li>CPU cores: 5760 </li></ul><ul><li>Tasks: 92160 </li></ul><ul><li>Elapsed time: 12821 sec </li></ul><ul><li>Compute time: 1.94 CPU years </li></ul><ul><li>Average task time: 660.3 sec </li></ul>(does not include ~800 sec to stage input data) Ioan Raicu Zhao Zhang
    15. 15. DOCK on BG/P: ~1M Tasks on 118,000 CPUs <ul><li>CPU cores: 118784 </li></ul><ul><li>Tasks: 934803 </li></ul><ul><li>Elapsed time: 7257 sec </li></ul><ul><li>Compute time: 21.43 CPU years </li></ul><ul><li>Average task time: 667 sec </li></ul><ul><li>Relative Efficiency: 99.7% </li></ul><ul><li>(from 16 to 32 racks) </li></ul><ul><li>Utilization: </li></ul><ul><ul><li>Sustained: 99.6% </li></ul></ul><ul><ul><li>Overall: 78.3% </li></ul></ul><ul><li>GPFS </li></ul><ul><ul><li>1 script (~5KB) </li></ul></ul><ul><ul><li>2 file read (~10KB) </li></ul></ul><ul><ul><li>1 file write (~10KB) </li></ul></ul><ul><li>RAM (cached from GPFS on first task per node) </li></ul><ul><ul><li>1 binary (~7MB) </li></ul></ul><ul><ul><li>Static input data (~45MB) </li></ul></ul>Ioan Raicu Zhao Zhang Mike Wilde Time (secs)
    16. 16. Managing 120K CPUs Slower shared storage High-speed local disk Falkon
    17. 17. MARS Economic Model Parameter Study <ul><li>2,048 BG/P CPU cores </li></ul><ul><li>Tasks: 49,152 </li></ul><ul><li>Micro-tasks: 7,077,888 </li></ul><ul><li>Elapsed time: 1,601 secs </li></ul><ul><li>CPU Hours: 894 </li></ul>Zhao Zhang Mike Wilde
    18. 18.
    19. 19. AstroPortal Stacking Service <ul><li>Purpose </li></ul><ul><ul><li>On-demand “stacks” of random locations within ~10TB dataset </li></ul></ul><ul><li>Challenge </li></ul><ul><ul><li>Rapid access to 10-10K “random” files </li></ul></ul><ul><ul><li>Time-varying load </li></ul></ul><ul><li>Sample Workloads </li></ul>S 4 Sloan Data Web page or Web Service + + + + + + = +
    20. 20. AstroPortal Stacking Service with Data Diffusion <ul><li>Aggregate throughput: </li></ul><ul><ul><li>39Gb/s </li></ul></ul><ul><ul><li>10X higher than GPFS </li></ul></ul><ul><li>Reduced load on GPFS </li></ul><ul><ul><li>0.49Gb/s </li></ul></ul><ul><ul><li>1/10 of the original load </li></ul></ul>Big performance gains as locality increases Ioan Raicu, 11:15am TOMORROW
    21. 21. B. Berriman, J. Good (Caltech) J. Jacob, D. Katz (JPL)
    22. 22. Montage Benchmark (Yong Zhao, Ioan Raicu, U.Chicago) MPI : ~950 lines of C for one stage Pegasus : ~1200 lines of C + tools to generate DAG for specific dataset SwiftScript : ~92 lines for any dataset
    23. 23. Summary <ul><li>Peta- and exa-scale computers enable us to tackle new problems at greater scales </li></ul><ul><ul><li>Parameter studies, ensembles, interactive data analysis, “workflows” of various kinds </li></ul></ul><ul><li>Such apps frequently stress petascale hardware and software in interesting ways </li></ul><ul><li>New programming models and tools required </li></ul><ul><ul><li>Mixed task/data parallelism, task management complex data management, failure, … </li></ul></ul><ul><ul><li>Tools (DAGman, Swift, Hadoop, …) exist but need refinement </li></ul></ul><ul><li>Interesting connections to distributed systems </li></ul>More info: www.ci.uchicago.edu/swift

    ×