Talk given at the Materials Genome Initiative Workshop on Building the Materials Innovation Infrastructure: Data and Standards, held May 14-15, 2012 at the U.S. Department of Commerce (Herbert Hoover) building in Washington, DC. I made the case that to deal effectively with BIG DATA, you need BIG PROCESS. I described how Globus Online is addressing that need.
Handwritten Text Recognition for manuscripts and early printed texts
Process automation for data-driven science
1. Process automation
for data-driven science
Ian Foster
Computation Institute
Argonne National Laboratory & The University of Chicago
Talk at Materials Genome Initiative Workshop, May 14-15, DC
www.ci.anl.gov
www.ci.uchicago.edu
2. Where we want to get to
Imagine if, when tackling a problem, we could
easily, both alone and within a distributed team:
• Assemble, integrate, and interpret all relevant
data—organized within a knowledge network
• Be informed of anomalies, patterns, and gaps
• Formulate and evaluate computational models
• Launch automated processes to test
hypotheses & expand the knowledge network
All within an environment in which productive
strategies could be easily scaled—and repeated
www.ci.anl.gov
2
www.ci.uchicago.edu
3. The attractive vs. the pragmatic
• Some attractive goals expressed yesterday
– “Record the complete process used to generate data”
– “Define standard formats and metadata”
– “Make users rate data every time they use it”
– “Eliminate incorrect data from databases”
• My pragmatic take on how best to proceed
– “Identify, automate, and streamline key
processes to make desirable behaviors
easy”
www.ci.anl.gov
3
www.ci.uchicago.edu
5. Tripit exemplifies process automation
Me Other services
Book flights Record flights
Suggest hotel
Book hotel Record hotel
Get weather
Prepare maps
Share info
Check prices
Monitor flight
www.ci.anl.gov
5
www.ci.uchicago.edu
6. Process automation for science
Run experiment
Collect data
Move data
Check data >5,000 registered users, >4 PB moved
Annotate data
Share data
Find similar data >25,000 registered users, >1PB access
Link to literature
Analyze data
>45,000 metagenomes, 12 Tbp
Publish data
www.ci.anl.gov
6
www.ci.uchicago.edu
7. A simple take on “big process for science”
Research Data Management-as-a-Service
Globus Globus Globus Globus …SaaS
Transfer Storage Collaborate Catalog
Globus Integrate …PaaS
www.ci.anl.gov
7
www.ci.uchicago.edu
8. Globus Transfer: Data movement
Research Data Management-as-a-Service
Globus Globus Globus Globus …SaaS
Transfer Storage Collaborate Catalog
Globus Integrate …PaaS
www.ci.anl.gov
8
www.ci.uchicago.edu
9.
10. Globus Transfer details
• Reliable file transfer.
– Easy “fire-and-forget” transfers
– Automatic fault recovery
– High performance
– Across multiple security domains
• No IT required.
– Software as a Service (SaaS)
• No client software installation
• New features automatically available
– Consolidated support & troubleshooting
– Works with existing GridFTP servers; Globus Connect for “last mile”
• >5000 users, >4 Petabytes and 500,000,000 files moved
• >99.9% uptime in 2012
Adopted by Advanced Photon Source, NERSC, Blue Waters, campuses
www.ci.anl.gov
10
www.ci.uchicago.edu
11. Globus Storage and Globus Collaborate
Research Data Management-as-a-Service
Globus Globus Globus Globus …SaaS
Transfer Storage Collaborate Catalog
Globus Integrate …PaaS
www.ci.anl.gov
11
www.ci.uchicago.edu
12. Globus Storage: For when you want to …
• Place your data where
you want
• Access it from anywhere Globus Transfer, HTTP/REST, Desktop sync
via different protocols
• Update it, version it,
Globus
and take snapshots Storage
volume
• Share versions with
who you want
• Synchronize among Commercial National Campus
storage service research computin
locations provider center g center
www.ci.anl.gov
12
www.ci.uchicago.edu
13. Globus Collaborate: For when you want to
Join with a few or many people to:
• Share documents
• Track tasks
• Send email
• Share data
• Do whatever
With:
• Common
groups
• Delegated
management
www.ci.anl.gov
13
www.ci.uchicago.edu
14. Globus Storage & Collaborate in action
Globus Connect
Bryce Move DTI results to PADS
Bryce’s laptop Compute
DTI Group Cluster
- Kyle
- Bryce Globus Storage Globus Transfer
Create snapshot to Copy TBI data to
share with group compute cluster
Globus Nexus Globus Transfer
Add Bryce to TBI Move DTI results
collaboration to shared volume
Globus Collaborate
Publish DTI data to TBI
web site
Amazon S3
Globus Storage
Create volume and
share with TBI group SDSC
UChicago
Cloud
Kyle “TBI” Object
Globus Connect volume Store
Move MRI files to Cornell
TBI=Traumatic Brain Injury TBI shared volume Red Cloud
DTI=Diffusion Tensor Imaging www.ci.anl.gov
14
MRI=Magnetic Resonance Imaging www.ci.uchicago.edu
15. Use case: Earth System Grid
Outsource data transfer to Globus
– Data download from search
– Data transfer to another server
– Replication between sites
Next step is automated publication
No ESGF client software needed
www.ci.anl.gov
15
www.ci.uchicago.edu
16. Data acquisition, management, analysis
don’t
Experiments Literature Computations
forget!
Big Data (volume, velocity, variety, variability)
… demands Big Process in order for discovery to scale
www.ci.anl.gov
16
www.ci.uchicago.edu
17. How to proceed
• Top down:
– Large-scale integration, standardized formats,
common protocols, etc.
– Good if achieved, but likely to be slow and painful
• Bottom up:
– Consider opportunities to encourage useful
behaviors via outsourcing and automation
– Making data accessible is the first (and easiest?) 90%
– Facilitate sharing, annotation, emergence of
(localized) structure, bridging among structures
www.ci.anl.gov
17
www.ci.uchicago.edu
18. Acknowledgements
• Thanks for vital and much appreciated support:
– DOE Office of Advanced Scientific Computing
Research (ASCR)
– NSF Office of Cyberinfrastructure (OCI)
– National Institutes of Health
– The University of Chicago
• Thanks to the Globus Online team at the
University of Chicago and Argonne for their
amazing work. See
https://www.globusonline.org/about/goteam/
www.ci.anl.gov
18
www.ci.uchicago.edu
Given continued exponential growth along so many dimensions …… process efficiencies must improve at a comparable rate to maintain just constant progress