• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Process automation for data-driven science
 

Process automation for data-driven science

on

  • 653 views

Talk given at the Materials Genome Initiative Workshop on Building the Materials Innovation Infrastructure: Data and Standards, held May 14-15, 2012 at the U.S. Department of Commerce (Herbert Hoover) ...

Talk given at the Materials Genome Initiative Workshop on Building the Materials Innovation Infrastructure: Data and Standards, held May 14-15, 2012 at the U.S. Department of Commerce (Herbert Hoover) building in Washington, DC. I made the case that to deal effectively with BIG DATA, you need BIG PROCESS. I described how Globus Online is addressing that need.

Statistics

Views

Total Views
653
Views on SlideShare
650
Embed Views
3

Actions

Likes
0
Downloads
12
Comments
0

2 Embeds 3

http://www.linkedin.com 2
https://si0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Given continued exponential growth along so many dimensions …… process efficiencies must improve at a comparable rate to maintain just constant progress

Process automation for data-driven science Process automation for data-driven science Presentation Transcript

  • Process automationfor data-driven scienceIan FosterComputation InstituteArgonne National Laboratory & The University of ChicagoTalk at Materials Genome Initiative Workshop, May 14-15, DC www.ci.anl.gov www.ci.uchicago.edu
  • Where we want to get to Imagine if, when tackling a problem, we could easily, both alone and within a distributed team: • Assemble, integrate, and interpret all relevant data—organized within a knowledge network • Be informed of anomalies, patterns, and gaps • Formulate and evaluate computational models • Launch automated processes to test hypotheses & expand the knowledge network All within an environment in which productive strategies could be easily scaled—and repeated www.ci.anl.gov2 www.ci.uchicago.edu
  • The attractive vs. the pragmatic• Some attractive goals expressed yesterday – “Record the complete process used to generate data” – “Define standard formats and metadata” – “Make users rate data every time they use it” – “Eliminate incorrect data from databases”• My pragmatic take on how best to proceed – “Identify, automate, and streamline key processes to make desirable behaviors easy” www.ci.anl.gov3 www.ci.uchicago.edu
  • www.ci.anl.gov4 www.ci.uchicago.edu
  • Tripit exemplifies process automation Me Other services Book flights Record flights Suggest hotel Book hotel Record hotel Get weather Prepare maps Share info Check prices Monitor flight www.ci.anl.gov5 www.ci.uchicago.edu
  • Process automation for science Run experiment Collect data Move data Check data >5,000 registered users, >4 PB moved Annotate data Share data Find similar data >25,000 registered users, >1PB access Link to literature Analyze data >45,000 metagenomes, 12 Tbp Publish data www.ci.anl.gov6 www.ci.uchicago.edu
  • A simple take on “big process for science” Research Data Management-as-a-Service Globus Globus Globus Globus …SaaS Transfer Storage Collaborate Catalog Globus Integrate …PaaS www.ci.anl.gov7 www.ci.uchicago.edu
  • Globus Transfer: Data movement Research Data Management-as-a-Service Globus Globus Globus Globus …SaaS Transfer Storage Collaborate Catalog Globus Integrate …PaaS www.ci.anl.gov8 www.ci.uchicago.edu
  • Globus Transfer details• Reliable file transfer. – Easy “fire-and-forget” transfers – Automatic fault recovery – High performance – Across multiple security domains• No IT required. – Software as a Service (SaaS) • No client software installation • New features automatically available – Consolidated support & troubleshooting – Works with existing GridFTP servers; Globus Connect for “last mile”• >5000 users, >4 Petabytes and 500,000,000 files moved• >99.9% uptime in 2012Adopted by Advanced Photon Source, NERSC, Blue Waters, campuses www.ci.anl.gov10 www.ci.uchicago.edu
  • Globus Storage and Globus Collaborate Research Data Management-as-a-Service Globus Globus Globus Globus …SaaS Transfer Storage Collaborate Catalog Globus Integrate …PaaS www.ci.anl.gov11 www.ci.uchicago.edu
  • Globus Storage: For when you want to …• Place your data where you want• Access it from anywhere Globus Transfer, HTTP/REST, Desktop sync via different protocols• Update it, version it, Globus and take snapshots Storage volume• Share versions with who you want• Synchronize among Commercial National Campus storage service research computin locations provider center g center www.ci.anl.gov 12 www.ci.uchicago.edu
  • Globus Collaborate: For when you want toJoin with a few or many people to:• Share documents• Track tasks• Send email• Share data• Do whateverWith:• Common groups• Delegated management www.ci.anl.gov13 www.ci.uchicago.edu
  • Globus Storage & Collaborate in action Globus Connect Bryce Move DTI results to PADS Bryce’s laptop Compute DTI Group Cluster - Kyle - Bryce Globus Storage Globus Transfer Create snapshot to Copy TBI data to share with group compute cluster Globus Nexus Globus Transfer Add Bryce to TBI Move DTI results collaboration to shared volume Globus Collaborate Publish DTI data to TBI web site Amazon S3 Globus Storage Create volume and share with TBI group SDSC UChicago CloudKyle “TBI” Object Globus Connect volume Store Move MRI files to CornellTBI=Traumatic Brain Injury TBI shared volume Red CloudDTI=Diffusion Tensor Imaging www.ci.anl.gov 14MRI=Magnetic Resonance Imaging www.ci.uchicago.edu
  • Use case: Earth System GridOutsource data transfer to Globus – Data download from search – Data transfer to another server – Replication between sitesNext step is automated publicationNo ESGF client software needed www.ci.anl.gov15 www.ci.uchicago.edu
  • Data acquisition, management, analysis don’t Experiments Literature Computations forget! Big Data (volume, velocity, variety, variability) … demands Big Process in order for discovery to scale www.ci.anl.gov16 www.ci.uchicago.edu
  • How to proceed• Top down: – Large-scale integration, standardized formats, common protocols, etc. – Good if achieved, but likely to be slow and painful• Bottom up: – Consider opportunities to encourage useful behaviors via outsourcing and automation – Making data accessible is the first (and easiest?) 90% – Facilitate sharing, annotation, emergence of (localized) structure, bridging among structures www.ci.anl.gov17 www.ci.uchicago.edu
  • Acknowledgements• Thanks for vital and much appreciated support: – DOE Office of Advanced Scientific Computing Research (ASCR) – NSF Office of Cyberinfrastructure (OCI) – National Institutes of Health – The University of Chicago• Thanks to the Globus Online team at the University of Chicago and Argonne for their amazing work. See https://www.globusonline.org/about/goteam/ www.ci.anl.gov18 www.ci.uchicago.edu
  • Thank you!foster@anl.govfoster@uchicago.edu www.ci.anl.gov www.ci.uchicago.edu