A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
Presentation on how to chat with PDF using ChatGPT code interpreter
Extreme Scripting July 2009
1. Extreme scriptingand other adventures in data-intensive computing Ian FosterAllan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang Computation Institute Argonne National Lab & University of Chicago
3. How data analysis really happensin scientific laboratories %foo file1 > file2 % bar file2 > file3 %foo file1 | bar > file3 % foreachf (f1 f2 f3 f4 f5 f6 f7 … f100) foreach?foo $f.in | bar > $f.out foreach? end % % Now where on earth is f98.out, and how did I generate it again? Now: command not found. %
4. Extreme scripting Complex scripts Swift Many activities Numerous files Complex data Data dependencies Many programs Simple scripts Big computers Small computers Many processors Storage hierarchy Failure Heterogeneity Preserving file system semantics, ability to call arbitrary executables
8. 6 GB 2M structures (6 GB) ~4M x 60s x 1 cpu ~60K cpu-hrs FRED DOCK6 Select best ~5K Select best ~5K ~10K x 20m x 1 cpu ~3K cpu-hrs Amber Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC ZINC 3-D structures Manually prep DOCK6 rec file Manually prep FRED rec file NAB scriptparameters (defines flexible residues, #MDsteps) NAB Script Template DOCK6 Receptor (1 per protein: defines pocket to bind to) FRED Receptor (1 per protein: defines pocket to bind to) PDB protein descriptions 1 protein (1MB) BuildNABScript Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript NAB Script start Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript For 1 target: 4 million tasks500,000 cpu-hrs (50 cpu-years) end report ligands complexes
11. DOCK on BG/P: ~1M tasks on 119,000 CPUs 118784 cores 934803 tasks Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task: 667 sec Time (sec) Relativeefficiency 99.7% (from 16 to 32 racks) Utilization: 99.6% sustained, 78.3% overall Ioan Raicu et al.
13. Scaling Posix to petascale Global file system Chirp(multicast) Staging Torus and tree interconnects Intermediate CN-striped intermediate file system Largedataset MosaStore(striping) … IFScompute node IFScompute node LFS LFS IFSseg IFSseg ZOID on I/O node Computenode(local datasets) Computenode(local datasets) ZOID IFS Local . . .
14. Efficiency for 4 second tasks and varying data size(1KB to 1MB) for CIO and GPFS up to 32K processors
15. + + + + + + + = Provisioning for data-intensive workloads Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey Challenges Random data access Much computing Time-varying load Solution Dynamic acquisition of compute & storage Data diffusion Sloan Data S IoanRaicu
18. Data diffusion sine-wave workload: Summary GPFS 5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP 1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs
19. Data-intensive computing @ Computation Institute: Example applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics
20.
21. Sequencing outpaces Moore’s law BLAST On EC2, US$ Next-gen Solexa 454 Solexa Gigabases Folker Meyer, Computation Institute
22. Data-intensive computing @ Computation Institute: Hardware 1000 TBtape backup Dynamic provisioning 500 TB reliable storage (data & metadata) Parallel analysis Diversedatasources Remote access P A D S 180 TB, 180 GB/s 17 Top/s analysis Diverseusers Data ingest Offload to remote data centers PADS: Petascale Active Data Store (NSF MRI)
23. Data-intensive computing @ Computation Institute: Software HPC systems software (MPICH, PVFS, ZeptOS) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management (Workspace Service) Distributed data management (GridFTP, etc.)
24. Data-intensive computing is an end-to-end problem Low Chaos Zone ofcomplexity Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996
25. We need to function in the zone of complexity Low Chaos Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996
26. The Grid paradigm Principles and mechanisms for dynamic virtual organizations Leverage service oriented architecture Loose coupling of data and services Open software,architecture Engineering Biomedicine Computer science Physics Healthcare Astronomy Biology 1995 2000 2005 2010
27. As of Oct19, 2008: 122 participants 105 services 70 data 35 analytical
29. Summary Extreme scripting offers the potential for easy scaling of proven working practices Interesting technical problems relating to programming and I/O models Many wonderful applications Data-intensive computing is an end-to-end problem Data generation, integration, analysis, etc., is a continuous, loosely coupled process
Another perspective on the problem. A few words of explanation. If we are deploying a hospital IT system, we are (hopefully) in the bottom left hand corner.“You can’t achieve success via central planning.” Quoted in Crossing the Quality Chasm, p. 312In our scenarios, we don’t have that ability to control.
What is the alternative? We can put in place mechanisms that facilitate groups with some common goal to form and function.Over time, things change, these groups evolve.If we are successful, they can expand, perhaps merge.Challenges: make this easy. Leverage scale effects.
Principles and mechanisms that has been under development for some years.First CS, then physical sciences, then biology, most recently biomedicine –