Process automationfor data-driven scienceIan FosterComputation InstituteArgonne National Laboratory & The University of Ch...
Where we want to get to    Imagine if, when tackling a problem, we could    easily, both alone and within a distributed te...
The attractive vs. the pragmatic•   Some attractive goals expressed yesterday    – “Record the complete process used to ge...
www.ci.anl.gov4    www.ci.uchicago.edu
Tripit exemplifies process automation       Me                           Other services    Book flights   Record flights  ...
Process automation for science    Run experiment       Collect data       Move data       Check data          >5,000 regis...
A simple take on “big process for science”              Research Data Management-as-a-Service       Globus        Globus  ...
Globus Transfer: Data movement              Research Data Management-as-a-Service       Globus        Globus         Globu...
Globus Transfer details• Reliable file transfer.     –   Easy “fire-and-forget” transfers     –   Automatic fault recovery...
Globus Storage and Globus Collaborate              Research Data Management-as-a-Service       Globus        Globus       ...
Globus Storage: For when you want to …•    Place your data where     you want•    Access it from anywhere   Globus Transfe...
Globus Collaborate: For when you want toJoin with a few or many people to:• Share documents• Track tasks• Send email• Shar...
Globus Storage & Collaborate in action                                               Globus Connect                       ...
Use case: Earth System GridOutsource data transfer to Globus – Data download from search – Data transfer to another server...
Data acquisition, management, analysis                          don’t           Experiments Literature Computations       ...
How to proceed•    Top down:     – Large-scale integration, standardized formats,       common protocols, etc.     – Good ...
Acknowledgements•    Thanks for vital and much appreciated support:     – DOE Office of Advanced Scientific Computing     ...
Thank you!foster@anl.govfoster@uchicago.edu                      www.ci.anl.gov                      www.ci.uchicago.edu
Process automation for data-driven science
Upcoming SlideShare
Loading in …5
×

Process automation for data-driven science

848 views

Published on

Talk given at the Materials Genome Initiative Workshop on Building the Materials Innovation Infrastructure: Data and Standards, held May 14-15, 2012 at the U.S. Department of Commerce (Herbert Hoover) building in Washington, DC. I made the case that to deal effectively with BIG DATA, you need BIG PROCESS. I described how Globus Online is addressing that need.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
848
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Given continued exponential growth along so many dimensions …… process efficiencies must improve at a comparable rate to maintain just constant progress
  • Process automation for data-driven science

    1. 1. Process automationfor data-driven scienceIan FosterComputation InstituteArgonne National Laboratory & The University of ChicagoTalk at Materials Genome Initiative Workshop, May 14-15, DC www.ci.anl.gov www.ci.uchicago.edu
    2. 2. Where we want to get to Imagine if, when tackling a problem, we could easily, both alone and within a distributed team: • Assemble, integrate, and interpret all relevant data—organized within a knowledge network • Be informed of anomalies, patterns, and gaps • Formulate and evaluate computational models • Launch automated processes to test hypotheses & expand the knowledge network All within an environment in which productive strategies could be easily scaled—and repeated www.ci.anl.gov2 www.ci.uchicago.edu
    3. 3. The attractive vs. the pragmatic• Some attractive goals expressed yesterday – “Record the complete process used to generate data” – “Define standard formats and metadata” – “Make users rate data every time they use it” – “Eliminate incorrect data from databases”• My pragmatic take on how best to proceed – “Identify, automate, and streamline key processes to make desirable behaviors easy” www.ci.anl.gov3 www.ci.uchicago.edu
    4. 4. www.ci.anl.gov4 www.ci.uchicago.edu
    5. 5. Tripit exemplifies process automation Me Other services Book flights Record flights Suggest hotel Book hotel Record hotel Get weather Prepare maps Share info Check prices Monitor flight www.ci.anl.gov5 www.ci.uchicago.edu
    6. 6. Process automation for science Run experiment Collect data Move data Check data >5,000 registered users, >4 PB moved Annotate data Share data Find similar data >25,000 registered users, >1PB access Link to literature Analyze data >45,000 metagenomes, 12 Tbp Publish data www.ci.anl.gov6 www.ci.uchicago.edu
    7. 7. A simple take on “big process for science” Research Data Management-as-a-Service Globus Globus Globus Globus …SaaS Transfer Storage Collaborate Catalog Globus Integrate …PaaS www.ci.anl.gov7 www.ci.uchicago.edu
    8. 8. Globus Transfer: Data movement Research Data Management-as-a-Service Globus Globus Globus Globus …SaaS Transfer Storage Collaborate Catalog Globus Integrate …PaaS www.ci.anl.gov8 www.ci.uchicago.edu
    9. 9. Globus Transfer details• Reliable file transfer. – Easy “fire-and-forget” transfers – Automatic fault recovery – High performance – Across multiple security domains• No IT required. – Software as a Service (SaaS) • No client software installation • New features automatically available – Consolidated support & troubleshooting – Works with existing GridFTP servers; Globus Connect for “last mile”• >5000 users, >4 Petabytes and 500,000,000 files moved• >99.9% uptime in 2012Adopted by Advanced Photon Source, NERSC, Blue Waters, campuses www.ci.anl.gov10 www.ci.uchicago.edu
    10. 10. Globus Storage and Globus Collaborate Research Data Management-as-a-Service Globus Globus Globus Globus …SaaS Transfer Storage Collaborate Catalog Globus Integrate …PaaS www.ci.anl.gov11 www.ci.uchicago.edu
    11. 11. Globus Storage: For when you want to …• Place your data where you want• Access it from anywhere Globus Transfer, HTTP/REST, Desktop sync via different protocols• Update it, version it, Globus and take snapshots Storage volume• Share versions with who you want• Synchronize among Commercial National Campus storage service research computin locations provider center g center www.ci.anl.gov 12 www.ci.uchicago.edu
    12. 12. Globus Collaborate: For when you want toJoin with a few or many people to:• Share documents• Track tasks• Send email• Share data• Do whateverWith:• Common groups• Delegated management www.ci.anl.gov13 www.ci.uchicago.edu
    13. 13. Globus Storage & Collaborate in action Globus Connect Bryce Move DTI results to PADS Bryce’s laptop Compute DTI Group Cluster - Kyle - Bryce Globus Storage Globus Transfer Create snapshot to Copy TBI data to share with group compute cluster Globus Nexus Globus Transfer Add Bryce to TBI Move DTI results collaboration to shared volume Globus Collaborate Publish DTI data to TBI web site Amazon S3 Globus Storage Create volume and share with TBI group SDSC UChicago CloudKyle “TBI” Object Globus Connect volume Store Move MRI files to CornellTBI=Traumatic Brain Injury TBI shared volume Red CloudDTI=Diffusion Tensor Imaging www.ci.anl.gov 14MRI=Magnetic Resonance Imaging www.ci.uchicago.edu
    14. 14. Use case: Earth System GridOutsource data transfer to Globus – Data download from search – Data transfer to another server – Replication between sitesNext step is automated publicationNo ESGF client software needed www.ci.anl.gov15 www.ci.uchicago.edu
    15. 15. Data acquisition, management, analysis don’t Experiments Literature Computations forget! Big Data (volume, velocity, variety, variability) … demands Big Process in order for discovery to scale www.ci.anl.gov16 www.ci.uchicago.edu
    16. 16. How to proceed• Top down: – Large-scale integration, standardized formats, common protocols, etc. – Good if achieved, but likely to be slow and painful• Bottom up: – Consider opportunities to encourage useful behaviors via outsourcing and automation – Making data accessible is the first (and easiest?) 90% – Facilitate sharing, annotation, emergence of (localized) structure, bridging among structures www.ci.anl.gov17 www.ci.uchicago.edu
    17. 17. Acknowledgements• Thanks for vital and much appreciated support: – DOE Office of Advanced Scientific Computing Research (ASCR) – NSF Office of Cyberinfrastructure (OCI) – National Institutes of Health – The University of Chicago• Thanks to the Globus Online team at the University of Chicago and Argonne for their amazing work. See https://www.globusonline.org/about/goteam/ www.ci.anl.gov18 www.ci.uchicago.edu
    18. 18. Thank you!foster@anl.govfoster@uchicago.edu www.ci.anl.gov www.ci.uchicago.edu

    ×