Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ian Foster
Accelerating discovery
via science services
Life Sciences
and Biology
Advanced
MaterialsCondensed
Matter Physics
Chemistry and
Catalysis
Soft Materials
Environmental
...
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
The ...
J.C.R Licklider, 1960:
About 85% of my
“thinking” time
was spent getting into
a position to think,
to make a decision,
to ...
Outsourcing
for economies of
scale in the use of
automated methods
Automation
to apply more
sophisticated
methods at large...
Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”—Rolf Heuer, CERN DG
10s of P...
Outsourcing and automation:
(2) The Cloud
The Software as a Service
(SaaS) revolution
Customer relationship
management (CRM):
A knowledge-intensive process
Historic...
Where can we automate and
outsource in science broadly?
Run experiment
Collect data
Move data
Check data
Annotate data
Sha...
Many services are used by
science, but have limitations
Science services exist, but do
not address whole life cycle
Accelerating
discovery
via science services
(1) Eliminate data friction
The elimination of data friction
is a key to faster discovery
Civilization advances
by extending the number
of important o...
We have the highways but not
the delivery service
Our highways
encompass the Internet,
ultra-high-speed networks,
science ...
Globus: Research data
management as a service
Essential research data
management services
 File transfer
 Data sharing
...
“I need to easily, quickly, & reliably
move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Hom...
17
One APS node
connects to
125 locations
“I need to get data from a scientific
instrument to my analysis system.”
Next Gen
Sequencer
Light Sheet Microscope
MRI Adv...
“I need to easily and securely
share my data with my colleagues.”
19
Globus and the research data lifecycle
Researcher initiates
transfer request; or
requested automatically
by script, scienc...
Globus and DOE:
Terabytes per month
5
major
services
130
federated
campus IdPs
115
petabytes
transferred
8,000
managed
storage systems
20 billion
files
proces...
Accelerating
discovery
via science services
(2) Create platform services
Globus service APIs provide
elements of a science platform
Identity, Group, and
Profile Management
…
Globus Toolkit
Globus...
Publication as service for ACME
climate modeling consortium
kbase.us
Accelerating
discovery
via science services
(3) Liberate scientific data
Q: What is the biggest obstacle
to data sharing in science?
A: The vast majority of data
that is lost, or not online;
if o...
We must automate the capture,
linking, and indexing of all data
Globus publication service
encodes and automates data
publ...
We must automate the capture,
linking, and indexing of all data
chiDB: Human-computer
collaboration to extract Flory-
Hugg...
Flory-Huggins parameters liberated!
R. Tchoua, J. De Pablo
“I need to publish my data so that
others can find it and use it.”
Scholarly
Publication
Reference
Dataset
Research
Commun...
Publish dashboard
35
Configuring a publication
pipeline: Publication “facets”
URL Handle DOI
identifier
none standard custom
description
domain...
Accelerating
discovery
via science services
(4) Create discovery engines
Data-driven science requires
collaborative discovery engines
informatics
analysis
high-throughput
experiments
problem
spec...
Example: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExpe...
Integrate data movement, management, workflow,
and computation to accelerate data-driven
applications, organize data for e...
Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correl...
Linking experiment and
computation
Single-crystal diffuse scattering
Defect structure in disordered materials.
(Osborn, Wi...
Rapid assessment of alignment quality
in high-energy diffraction microscopy
Blue Gene/Q
Orthros
(All data in NFS)
3: Gener...
Science services raise research
and policy questions
 What else can we automate and outsource?
 How do we choose opportu...
What would Beer say?
The question which asks how to
use the computer in the
enterprise, is, in short, the wrong
question. ...
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
ex...
Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
47
For more information: foster@anl.gov
Thanks to co-authors and Globus team
Globus services (globus.org)
 Foster, I. Globus...
Questions?
foster@anl.gov
Accelerating Discovery via Science Services
Accelerating Discovery via Science Services
Upcoming SlideShare
Loading in …5
×

Accelerating Discovery via Science Services

741 views

Published on

[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.

Published in: Science
  • Be the first to like this

Accelerating Discovery via Science Services

  1. 1. Ian Foster Accelerating discovery via science services
  2. 2. Life Sciences and Biology Advanced MaterialsCondensed Matter Physics Chemistry and Catalysis Soft Materials Environmental and Geo Sciences Can we determine pathways that lead to novel states and nonequilibrium assemblies? Can we observe – and control – nanoscale chemical transformations in macroscopic systems? Can we create new materials with extraordinary properties – by engineering defects at the atomic scale? Can we map – and ultimately harness – dynamic heterogeneity in complex correlated systems? Can we unravel the secrets of biological function – across length scales? Can we understand physical and chemical processes in the most extreme environments? 2 We want to accelerate progress on the most pressing questions
  3. 3. Publish results Collect data Design experiment Test hypothesis Hypothesize explanation Identify patterns Analyze data The discovery process is iterative and time-consuming Pose question
  4. 4. J.C.R Licklider, 1960: About 85% of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
  5. 5. Outsourcing for economies of scale in the use of automated methods Automation to apply more sophisticated methods at larger scales
  6. 6. Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG 10s of PB, 100s of institutions, 1000s of scientists, 100Ks of CPUs, Bs of tasks Outsourcing and automation: (1) The Grid
  7. 7. Outsourcing and automation: (2) The Cloud
  8. 8. The Software as a Service (SaaS) revolution Customer relationship management (CRM): A knowledge-intensive process Historically, handled manually or via expensive, inflexible on- premise software SaaS has revolutionized how CRM is consumed  Outsource to provider who runs software on cloud  Access via simple interfaces Ease of use Cost Flexibility SaaS On-premise
  9. 9. Where can we automate and outsource in science broadly? Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data Automate and outsource Science services
  10. 10. Many services are used by science, but have limitations
  11. 11. Science services exist, but do not address whole life cycle
  12. 12. Accelerating discovery via science services (1) Eliminate data friction
  13. 13. The elimination of data friction is a key to faster discovery Civilization advances by extending the number of important operations which we can perform without thinking about them (Whitehead, 1912) Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)
  14. 14. We have the highways but not the delivery service Our highways encompass the Internet, ultra-high-speed networks, science DMZs, data transfer nodes, high-speed transport protocols A good delivery service automates, schedules, accelerates, adapts. It provides APIs for experts and casual users. Cuts costs and saves
  15. 15. Globus: Research data management as a service Essential research data management services  File transfer  Data sharing  Data publication  Identity and groups Builds on 15 years of DOE research Outsourced and automated  High availability, reliability, performance, scalability  Convenient for  Casual users: Web interfaces  Power users: APIs  Administrators: Install, manage globus.org
  16. 16. “I need to easily, quickly, & reliably move data to other locations.” Research Computing HPC Cluster Lab Server Campus Home Filesystem Desktop Workstation Personal Laptop DOE supercomputer Public Cloud 16
  17. 17. 17 One APS node connects to 125 locations
  18. 18. “I need to get data from a scientific instrument to my analysis system.” Next Gen Sequencer Light Sheet Microscope MRI Advanced Light Source 18
  19. 19. “I need to easily and securely share my data with my colleagues.” 19
  20. 20. Globus and the research data lifecycle Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Dublin core and domain- specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • SaaS  Only a web browser required • Use storage system of your choice • Access using your campus credentials 20
  21. 21. Globus and DOE: Terabytes per month
  22. 22. 5 major services 130 federated campus IdPs 115 petabytes transferred 8,000 managed storage systems 20 billion files processed 99.95% uptime over past 2 years 25,000 registered users >30 institutional subscribers 3 months longest transfer 1 petabyte biggest transfer 50M most files in one transfer 13 national labs use services Globus by the numbers
  23. 23. Accelerating discovery via science services (2) Create platform services
  24. 24. Globus service APIs provide elements of a science platform Identity, Group, and Profile Management … Globus Toolkit GlobusAPIs GlobusConnect Data Publication & Discovery File Sharing File Transfer & Replication 25
  25. 25. Publication as service for ACME climate modeling consortium
  26. 26. kbase.us
  27. 27. Accelerating discovery via science services (3) Liberate scientific data
  28. 28. Q: What is the biggest obstacle to data sharing in science? A: The vast majority of data that is lost, or not online; if online, not described; if described, not indexed Not accessible Not discoverable Not used Contrast with common practice for consumer photos (iPhoto)  Automated capture  Publish then curate  Processing to add value  Outsourced storage
  29. 29. We must automate the capture, linking, and indexing of all data Globus publication service encodes and automates data publication pipelines Example application: Materials Data Facility for materials simulation and experiment data Proposed distributed virtual collections index, organize, tag, & manage distributed data Think iPhoto on steroids – backed by domain knowledge and supercomputing power
  30. 30. We must automate the capture, linking, and indexing of all data chiDB: Human-computer collaboration to extract Flory- Huggins (𝞆) parameters from polymers literature R. Tchoua et al. Plenario: Spatially and temporally integrated, linked, and searchable database of urban data C. Catlett, B. Goldstein, T. Malik et al.
  31. 31. Flory-Huggins parameters liberated! R. Tchoua, J. De Pablo
  32. 32. “I need to publish my data so that others can find it and use it.” Scholarly Publication Reference Dataset Research Community Collaboration
  33. 33. Publish dashboard 35
  34. 34. Configuring a publication pipeline: Publication “facets” URL Handle DOI identifier none standard custom description domain-specific none acceptance machine-validated curation human-validated anonymous Public collaborators access embargoed transient project lifetime “forever” preservation archive 36
  35. 35. Accelerating discovery via science services (4) Create discovery engines
  36. 36. Data-driven science requires collaborative discovery engines informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis & visualization Integrated databases Rick Stevens
  37. 37. Example: A discovery engine for disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  38. 38. Integrate data movement, management, workflow, and computation to accelerate data-driven applications, organize data for efficient use New architectures and methods create opportunities and challenges Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
  39. 39. Simulation Characterize, Predict Assimilate Steer data acquisition Data analysis Reconstruct, detect features, auto-correlate, particle distributions, … Science automation services Scripting, security, storage, cataloging, transfer ~0.001-0.5 GB/s/flow ~2 GB/s total burst ~200 TB/month ~10 concurrent flows (Today: x10 in 5 yrs) Integration Optimize, fit, … Configure Check Guide Batch Immediate 0.001 1 100+ PFlops Precompute material database Reconstruct image Auto- correlation Feature detection Scientific opportunities  Probe material structure and function at unprecedented scales Technical challenges  Many experimental modalities  Data rates and computation needs vary widely; increasing  Knowledge management, integration, synthesis Towards discovery engines for energy science (Argonne LDRD)
  40. 40. Linking experiment and computation Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP). Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.) Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes, vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime. X-ray nano/microtomography Bio, geo, and material science imaging. (Bicer, Gursoy, Kettimuthu, De Carlo, et al.). Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response 2-BM 1-ID 6-ID Populate Sim Sim Select Sim Microstructure of a copper wire, 0.2mm diameter Advanced Photon Source Experimental and simulated scattering from manganite
  41. 41. Rapid assessment of alignment quality in high-energy diffraction microscopy Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total feedback to experiment Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow ProgressWorkflow Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma Justin Wozniak Mike Wilde Jon Almer
  42. 42. Science services raise research and policy questions  What else can we automate and outsource?  How do we choose opportunities?  How do we measure success?  How must our computer systems evolve?  High-capacity discovery engines: where, how?  What will science become in a services era?  Will it be more democratic? Collaborative? Entrepreneurial? More or less creative?  What are implications for trust and reproducibility?
  43. 43. What would Beer say? The question which asks how to use the computer in the enterprise, is, in short, the wrong question. A better formulation is to ask how the enterprise should be run given that computers exist. The best version of all is the question asking what, given computers, the enterprise now is. – Stafford Beer, “Brain of the Firm”, 1972
  44. 44. informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis & visualization Integrated databases Opportunities and challenges for discovery acceleration Immediate opportunities  Reduce data friction and accelerate discovery by applying Globus services across DOE facilities  Develop new services to capture, link science data Important research agenda  Discovery engines to answer major scientific questions  New research modalities linking computation and data  Organization and analysis of massive science data
  45. 45. Thank you to our sponsors! U.S. DEPARTMENT OF ENERGY 47
  46. 46. For more information: foster@anl.gov Thanks to co-authors and Globus team Globus services (globus.org)  Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.  Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer, Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.  Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014. Publication (globus.org/data-publication)  Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I., Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015 Discovery engines  Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde, M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.
  47. 47. Questions? foster@anl.gov

×