SlideShare a Scribd company logo
1 of 30
Extreme scriptingand other adventures in data-intensive computing Ian FosterAllan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang Computation Institute Argonne National Lab & University of Chicago
How data analysis happens at data-intensive computing workshops
How data analysis really happensin scientific laboratories %foo file1 > file2 % bar file2 > file3 %foo file1 | bar > file3 % foreachf (f1 f2 f3 f4 f5 f6 f7 … f100) foreach?foo $f.in | bar > $f.out foreach? end % % Now where on earth is f98.out, and how did I generate it again? Now: command not found. %
Extreme scripting Complex scripts Swift Many activities Numerous files Complex data Data dependencies Many programs Simple scripts Big computers Small computers Many processors Storage hierarchy Failure Heterogeneity Preserving file system semantics, ability to call arbitrary executables
Functional magnetic resonance imaging (fMRI) data analysis
AIRSN program definition (Run or) reorientRun (Run ir, string direction) { foreachVolume iv, i in ir.v { or.v[i] = reorient(iv, direction); } } (Run snr) functional ( Run r, NormAnat a,                                     Air shrink ) { Run yroRun = reorientRun( r , "y" ); Run roRun = reorientRun( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVectorrndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); }
Many many tasks:Identifying potential drug targets  2M+ ligands       Protein        xtarget(s)           Benoit Roux et al.
6 GB 2M structures (6 GB) ~4M x 60s x 1 cpu ~60K cpu-hrs FRED DOCK6 Select best ~5K Select best ~5K ~10K x 20m x 1 cpu ~3K cpu-hrs Amber Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC ZINC 3-D structures Manually prep DOCK6 rec file Manually prep FRED rec file NAB scriptparameters (defines flexible residues,  #MDsteps) NAB Script Template DOCK6 Receptor (1 per protein: defines pocket to bind to) FRED Receptor (1 per protein: defines pocket to bind to) PDB protein descriptions 1 protein (1MB) BuildNABScript Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript NAB Script start Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript For 1 target: 4 million tasks500,000 cpu-hrs (50 cpu-years) end report ligands complexes
IBM BG/P 570 Teraflop/s, 164,000 cores, 80 TB
DOCK on BG/P: ~1M tasks on 119,000 CPUs  118784 cores  934803 tasks  Elapsed time:  7257 sec  Compute time:  21.43 CPU years  Average task:  667 sec Time (sec) Relativeefficiency 99.7% (from 16 to 32 racks) Utilization: 99.6% sustained, 78.3% overall Ioan Raicu et al.
Managing 160,000 cores Falkon High-speed local “disk” Slower shared storage
Scaling Posix to petascale Global file system Chirp(multicast) Staging  Torus and tree interconnects  Intermediate CN-striped intermediate file system Largedataset MosaStore(striping) … IFScompute node IFScompute node LFS LFS IFSseg IFSseg ZOID on I/O node Computenode(local datasets) Computenode(local datasets) ZOID IFS Local . . .
Efficiency for 4 second tasks and varying data size(1KB to 1MB) for CIO and GPFS up to 32K processors
+ + + + + + + = Provisioning for data-intensive workloads Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey Challenges Random data access Much computing Time-varying load Solution Dynamic acquisition of compute & storage Data diffusion Sloan Data S IoanRaicu
IoanRaicu “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node
Same scenario, but with dynamic resource provisioning
Data diffusion sine-wave workload: Summary GPFS 	 5.70 hrs,  ~8Gb/s,  1138 CPU hrs DD+SRP  1.80 hrs, ~25Gb/s,   361 CPU hrs DD+DRP  1.86 hrs, ~24Gb/s,   253 CPU hrs
Data-intensive computing @ Computation Institute: Example applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics
Sequencing outpaces Moore’s law BLAST On EC2, US$ Next-gen Solexa 454 Solexa Gigabases Folker Meyer, Computation Institute
Data-intensive computing @ Computation Institute: Hardware 1000 TBtape backup Dynamic provisioning 500 TB reliable storage (data & metadata) Parallel analysis Diversedatasources Remote access P A D S 180 TB,  180 GB/s  17 Top/s    analysis Diverseusers Data ingest Offload to remote data centers PADS: Petascale Active Data Store (NSF MRI)
Data-intensive computing @ Computation Institute: Software HPC systems software (MPICH, PVFS, ZeptOS) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management (Workspace Service) Distributed data management (GridFTP, etc.)
Data-intensive computing is an end-to-end problem Low Chaos Zone ofcomplexity Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996
We need to function in the zone of complexity Low Chaos Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996
The Grid paradigm Principles and mechanisms for dynamic virtual organizations Leverage service oriented architecture Loose coupling of data and services Open software,architecture Engineering Biomedicine Computer science Physics Healthcare Astronomy Biology 1995             2000            2005            2010
As of Oct19, 2008: 122 participants 105 services 70 data 35 analytical
Multi-center clinical cancer trials image capture and review (Center for Health Informatics)
Summary Extreme scripting offers the potential for easy scaling of proven working practices Interesting technical problems relating to programming and I/O models Many wonderful applications Data-intensive computing is an end-to-end problem Data generation, integration, analysis, etc., is a continuous, loosely coupled process
Thank you! Computation Institutewww.ci.uchicago.edu

More Related Content

What's hot

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Apache spark session
Apache spark sessionApache spark session
Apache spark sessionknowbigdata
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysisYi-Feng Chang
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
RasterFrames + STAC
RasterFrames + STACRasterFrames + STAC
RasterFrames + STACSimeon Fitch
 
MotiFinder: locating motifs in the human genome
MotiFinder:  locating motifs in the human genomeMotiFinder:  locating motifs in the human genome
MotiFinder: locating motifs in the human genomeJerome Nilmeier
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and researchBrianna McHorse
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_ENKohei KaiGai
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud ExperiencesGuy Coates
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyJuan Antonio Vizcaino
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redactedRyan Breed
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
 
CrateDB 101: Geospatial data
CrateDB 101: Geospatial dataCrateDB 101: Geospatial data
CrateDB 101: Geospatial dataClaus Matzinger
 
Semantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsSemantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsRobert Piro
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisHenry Schreiner
 

What's hot (20)

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Apache spark session
Apache spark sessionApache spark session
Apache spark session
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
RasterFrames + STAC
RasterFrames + STACRasterFrames + STAC
RasterFrames + STAC
 
MotiFinder: locating motifs in the human genome
MotiFinder:  locating motifs in the human genomeMotiFinder:  locating motifs in the human genome
MotiFinder: locating motifs in the human genome
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redacted
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
CrateDB 101: Geospatial data
CrateDB 101: Geospatial dataCrateDB 101: Geospatial data
CrateDB 101: Geospatial data
 
Semantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsSemantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care Analytics
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for Analysis
 

Viewers also liked

AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009Ian Foster
 
Services for Science
Services for ScienceServices for Science
Services for ScienceIan Foster
 
Sociology Of The Grid May 2009
Sociology Of The Grid May 2009Sociology Of The Grid May 2009
Sociology Of The Grid May 2009Ian Foster
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010Ian Foster
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big DataIan Foster
 
GlobusWorld 2012 Foster Keynote
GlobusWorld 2012 Foster KeynoteGlobusWorld 2012 Foster Keynote
GlobusWorld 2012 Foster KeynoteIan Foster
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudIan Foster
 
Delivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with GlobusDelivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with GlobusIan Foster
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008Ian Foster
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
 
E science foster december 2010
E science foster december 2010E science foster december 2010
E science foster december 2010Ian Foster
 
Services for Science v2 (APAN26)
Services for Science v2 (APAN26)Services for Science v2 (APAN26)
Services for Science v2 (APAN26)Ian Foster
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 

Viewers also liked (19)

AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009
 
Services for Science
Services for ScienceServices for Science
Services for Science
 
Sociology Of The Grid May 2009
Sociology Of The Grid May 2009Sociology Of The Grid May 2009
Sociology Of The Grid May 2009
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big Data
 
GlobusWorld 2012 Foster Keynote
GlobusWorld 2012 Foster KeynoteGlobusWorld 2012 Foster Keynote
GlobusWorld 2012 Foster Keynote
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloud
 
Delivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with GlobusDelivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with Globus
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
E science foster december 2010
E science foster december 2010E science foster december 2010
E science foster december 2010
 
Services for Science v2 (APAN26)
Services for Science v2 (APAN26)Services for Science v2 (APAN26)
Services for Science v2 (APAN26)
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 

Similar to Extreme Scripting July 2009

Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Plank
PlankPlank
PlankFNian
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Node Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionYunong Xiao
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataZhong Wang
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconPeter Lawrey
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentreSteve Loughran
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesIan Foster
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 

Similar to Extreme Scripting July 2009 (20)

Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Plank
PlankPlank
Plank
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Node Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In Production
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentre
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 

More from Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 

More from Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 

Recently uploaded

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Extreme Scripting July 2009

  • 1. Extreme scriptingand other adventures in data-intensive computing Ian FosterAllan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang Computation Institute Argonne National Lab & University of Chicago
  • 2. How data analysis happens at data-intensive computing workshops
  • 3. How data analysis really happensin scientific laboratories %foo file1 > file2 % bar file2 > file3 %foo file1 | bar > file3 % foreachf (f1 f2 f3 f4 f5 f6 f7 … f100) foreach?foo $f.in | bar > $f.out foreach? end % % Now where on earth is f98.out, and how did I generate it again? Now: command not found. %
  • 4. Extreme scripting Complex scripts Swift Many activities Numerous files Complex data Data dependencies Many programs Simple scripts Big computers Small computers Many processors Storage hierarchy Failure Heterogeneity Preserving file system semantics, ability to call arbitrary executables
  • 5. Functional magnetic resonance imaging (fMRI) data analysis
  • 6. AIRSN program definition (Run or) reorientRun (Run ir, string direction) { foreachVolume iv, i in ir.v { or.v[i] = reorient(iv, direction); } } (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r , "y" ); Run roRun = reorientRun( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, 0.1 ); AirVectorrndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean( nr, "y", "null" ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, "6 6 6" ); }
  • 7. Many many tasks:Identifying potential drug targets 2M+ ligands Protein xtarget(s) Benoit Roux et al.
  • 8. 6 GB 2M structures (6 GB) ~4M x 60s x 1 cpu ~60K cpu-hrs FRED DOCK6 Select best ~5K Select best ~5K ~10K x 20m x 1 cpu ~3K cpu-hrs Amber Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC ZINC 3-D structures Manually prep DOCK6 rec file Manually prep FRED rec file NAB scriptparameters (defines flexible residues, #MDsteps) NAB Script Template DOCK6 Receptor (1 per protein: defines pocket to bind to) FRED Receptor (1 per protein: defines pocket to bind to) PDB protein descriptions 1 protein (1MB) BuildNABScript Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript NAB Script start Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript For 1 target: 4 million tasks500,000 cpu-hrs (50 cpu-years) end report ligands complexes
  • 9.
  • 10. IBM BG/P 570 Teraflop/s, 164,000 cores, 80 TB
  • 11. DOCK on BG/P: ~1M tasks on 119,000 CPUs 118784 cores 934803 tasks Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task: 667 sec Time (sec) Relativeefficiency 99.7% (from 16 to 32 racks) Utilization: 99.6% sustained, 78.3% overall Ioan Raicu et al.
  • 12. Managing 160,000 cores Falkon High-speed local “disk” Slower shared storage
  • 13. Scaling Posix to petascale Global file system Chirp(multicast) Staging  Torus and tree interconnects  Intermediate CN-striped intermediate file system Largedataset MosaStore(striping) … IFScompute node IFScompute node LFS LFS IFSseg IFSseg ZOID on I/O node Computenode(local datasets) Computenode(local datasets) ZOID IFS Local . . .
  • 14. Efficiency for 4 second tasks and varying data size(1KB to 1MB) for CIO and GPFS up to 32K processors
  • 15. + + + + + + + = Provisioning for data-intensive workloads Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey Challenges Random data access Much computing Time-varying load Solution Dynamic acquisition of compute & storage Data diffusion Sloan Data S IoanRaicu
  • 16. IoanRaicu “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node
  • 17. Same scenario, but with dynamic resource provisioning
  • 18. Data diffusion sine-wave workload: Summary GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs
  • 19. Data-intensive computing @ Computation Institute: Example applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics
  • 20.
  • 21. Sequencing outpaces Moore’s law BLAST On EC2, US$ Next-gen Solexa 454 Solexa Gigabases Folker Meyer, Computation Institute
  • 22. Data-intensive computing @ Computation Institute: Hardware 1000 TBtape backup Dynamic provisioning 500 TB reliable storage (data & metadata) Parallel analysis Diversedatasources Remote access P A D S 180 TB, 180 GB/s 17 Top/s analysis Diverseusers Data ingest Offload to remote data centers PADS: Petascale Active Data Store (NSF MRI)
  • 23. Data-intensive computing @ Computation Institute: Software HPC systems software (MPICH, PVFS, ZeptOS) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management (Workspace Service) Distributed data management (GridFTP, etc.)
  • 24. Data-intensive computing is an end-to-end problem Low Chaos Zone ofcomplexity Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996
  • 25. We need to function in the zone of complexity Low Chaos Agreement about outcomes Plan and control High Low High Certainty about outcomes Ralph Stacey, Complexity and Creativity in Organizations, 1996
  • 26. The Grid paradigm Principles and mechanisms for dynamic virtual organizations Leverage service oriented architecture Loose coupling of data and services Open software,architecture Engineering Biomedicine Computer science Physics Healthcare Astronomy Biology 1995 2000 2005 2010
  • 27. As of Oct19, 2008: 122 participants 105 services 70 data 35 analytical
  • 28. Multi-center clinical cancer trials image capture and review (Center for Health Informatics)
  • 29. Summary Extreme scripting offers the potential for easy scaling of proven working practices Interesting technical problems relating to programming and I/O models Many wonderful applications Data-intensive computing is an end-to-end problem Data generation, integration, analysis, etc., is a continuous, loosely coupled process
  • 30. Thank you! Computation Institutewww.ci.uchicago.edu

Editor's Notes

  1. Another perspective on the problem. A few words of explanation. If we are deploying a hospital IT system, we are (hopefully) in the bottom left hand corner.“You can’t achieve success via central planning.” Quoted in Crossing the Quality Chasm, p. 312In our scenarios, we don’t have that ability to control.
  2. What is the alternative? We can put in place mechanisms that facilitate groups with some common goal to form and function.Over time, things change, these groups evolve.If we are successful, they can expand, perhaps merge.Challenges: make this easy. Leverage scale effects.
  3. Principles and mechanisms that has been under development for some years.First CS, then physical sciences, then biology, most recently biomedicine –