Extreme Scripting July 2009

Extreme scriptingand other adventures in data-intensive computing Ian FosterAllan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang Computation Institute Argonne National Lab & University of Chicago

How data analysis happens at data-intensive computing workshops

How data analysis really happensin scientific laboratories %foo file1 > file2 % bar file2 > file3 %foo file1 | bar > file3 % foreachf (f1 f2 f3 f4 f5 f6 f7 … f100) foreach?foo $f.in | bar > $f.out foreach? end % % Now where on earth is f98.out, and how did I generate it again? Now: command not found. %

Extreme scripting Complex scripts Swift Many activities Numerous files Complex data Data dependencies Many programs Simple scripts Big computers Small computers Many processors Storage hierarchy Failure Heterogeneity Preserving file system semantics, ability to call arbitrary executables

Functional magnetic resonance imaging (fMRI) data analysis

AIRSN program definition (Run or) reorientRun (Run ir, string direction) { foreachVolume iv, i in ir.v { or.v[i] = reorient(iv, direction); } } (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r , "y" ); Run roRun = reorientRun( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVectorrndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); }

Many many tasks:Identifying potential drug targets 2M+ ligands Protein xtarget(s) Benoit Roux et al.

6 GB 2M structures (6 GB) ~4M x 60s x 1 cpu ~60K cpu-hrs FRED DOCK6 Select best ~5K Select best ~5K ~10K x 20m x 1 cpu ~3K cpu-hrs Amber Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC ZINC 3-D structures Manually prep DOCK6 rec file Manually prep FRED rec file NAB scriptparameters (defines flexible residues, #MDsteps) NAB Script Template DOCK6 Receptor (1 per protein: defines pocket to bind to) FRED Receptor (1 per protein: defines pocket to bind to) PDB protein descriptions 1 protein (1MB) BuildNABScript Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript NAB Script start Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript For 1 target: 4 million tasks500,000 cpu-hrs (50 cpu-years) end report ligands complexes

IBM BG/P 570 Teraflop/s, 164,000 cores, 80 TB

DOCK on BG/P: ~1M tasks on 119,000 CPUs 118784 cores 934803 tasks Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task: 667 sec Time (sec) Relativeefficiency 99.7% (from 16 to 32 racks) Utilization: 99.6% sustained, 78.3% overall Ioan Raicu et al.

Managing 160,000 cores Falkon High-speed local “disk” Slower shared storage

Scaling Posix to petascale Global file system Chirp(multicast) Staging  Torus and tree interconnects  Intermediate CN-striped intermediate file system Largedataset MosaStore(striping) … IFScompute node IFScompute node LFS LFS IFSseg IFSseg ZOID on I/O node Computenode(local datasets) Computenode(local datasets) ZOID IFS Local . . .

Efficiency for 4 second tasks and varying data size(1KB to 1MB) for CIO and GPFS up to 32K processors

+ + + + + + + = Provisioning for data-intensive workloads Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey Challenges Random data access Much computing Time-varying load Solution Dynamic acquisition of compute & storage Data diffusion Sloan Data S IoanRaicu

IoanRaicu “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node

Same scenario, but with dynamic resource provisioning

Data diffusion sine-wave workload: Summary GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs

Data-intensive computing @ Computation Institute: Example applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics

Sequencing outpaces Moore’s law BLAST On EC2, US$ Next-gen Solexa 454 Solexa Gigabases Folker Meyer, Computation Institute

Data-intensive computing @ Computation Institute: Hardware 1000 TBtape backup Dynamic provisioning 500 TB reliable storage (data & metadata) Parallel analysis Diversedatasources Remote access P A D S 180 TB, 180 GB/s 17 Top/s analysis Diverseusers Data ingest Offload to remote data centers PADS: Petascale Active Data Store (NSF MRI)

Data-intensive computing @ Computation Institute: Software HPC systems software (MPICH, PVFS, ZeptOS) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management (Workspace Service) Distributed data management (GridFTP, etc.)

Data-intensive computing is an end-to-end problem Low Chaos Zone ofcomplexity Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996

We need to function in the zone of complexity Low Chaos Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996

The Grid paradigm Principles and mechanisms for dynamic virtual organizations Leverage service oriented architecture Loose coupling of data and services Open software,architecture Engineering Biomedicine Computer science Physics Healthcare Astronomy Biology 1995 2000 2005 2010

As of Oct19, 2008: 122 participants 105 services 70 data 35 analytical

Multi-center clinical cancer trials image capture and review (Center for Health Informatics)

Summary Extreme scripting offers the potential for easy scaling of proven working practices Interesting technical problems relating to programming and I/O models Many wonderful applications Data-intensive computing is an end-to-end problem Data generation, integration, analysis, etc., is a continuous, loosely coupled process

Thank you! Computation Institutewww.ci.uchicago.edu

Extreme Scripting July 2009

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Extreme Scripting July 2009

Similar to Extreme Scripting July 2009 (20)

More from Ian Foster

More from Ian Foster (20)

Recently uploaded

Recently uploaded (20)

Extreme Scripting July 2009

Editor's Notes