Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ian Foster
Accelerating
data-driven discovery
in energy science
Distinguished Fellow
Life Sciences
and Biology
Advanced
MaterialsCondensed
Matter Physics
Chemistry and
Catalysis
Soft Materials
Environmental
...
The resulting data deluge
Spans biology, climate, cosmology, materials,
physics, urban sciences, …
Simulation data
Petasca...
Scientific progress requires
collaborative discovery engines
informatics
analysis
high-throughput
experiments
problem
spec...
Example: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExpe...
Accelerating
data-driven discovery
in energy science
(1) Eliminate data friction
Eliminating data friction is
essential to modern science
Civilization advances
by extending the number
of important operat...
Software as a service (SaaS)
as lubricant
Customer relationship
management (CRM):
A knowledge-intensive process
Historical...
Globus: Research data
management as a service
Essential research data
management services
 File transfer
 Data sharing
...
“I need to easily, quickly, & reliably
move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Hom...
“I need to get data from a scientific
instrument to my analysis system.”
Next Gen
Sequencer
Light Sheet Microscope
MRI Adv...
“I need to easily and securely
share my data with my colleagues.”
12
Globus and the research data lifecycle
Researcher initiates
transfer request; or
requested automatically
by script, scienc...
Globus at a glance
4
major services
13
national labs
use Globus
services
100 PB
petabytes transferred
8,000
active endpoin...
15
One APS node
connects to
125 locations
thru mid 2014
Same node
(1 Gbps link)
Globus and DOE:
Terabytes per month
Globus and DOE:
Running total terabytes
Globus and DOE:
Active users per month
Response has been gratifying
"Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institu...
Globus service APIs serve
as a science platform
Identity, Group, and
Profile Management
…
Globus Toolkit
GlobusAPIs
Globus...
Globus platform
services enable new
application capabilities
Publication as service for ACME
Globus platform
accelerates development
of new services
Operating a sustainable service
Globus is a not-for-profit
service for researchers
We adopt a subscription-
supported free...
Accelerating
data-driven discovery
in energy science
(2) Liberate scientific data
Q: What is the biggest obstacle
to data sharing in science?
A: The vast majority of data
that is lost, or not online;
if o...
We must automate the capture,
linking, and indexing of all data
Globus publication service
encodes and automates data
publ...
We must automate the capture,
linking, and indexing of all data
chiDB: Human-computer
collaboration to extract Flory-
Hugg...
“I need to publish my data so that
others can find it and use it.”
Scholarly
Publication
Reference
Dataset
Research
Commun...
Publish dashboard
31
Start a new submission
32
33
Describe submission:
1) Dublin Core
34
Describe submission:
2) Science metadata
Assemble the dataset
35
36
Transfer files to
submission endpoint
37
Check dataset is
assembled correctly
Submission now in curation
workflow
38
Search published datasets
39
Search across collections
Discover a published dataset
41
Select a published dataset
42
View downloaded dataset
43
Configuring a publication
pipeline: Publication “facets”
URL Handle DOI
identifier
none standard custom
description
domain...
Accelerating
data-driven discovery
in energy science
(3) Create discovery engines
at DOE facilities
Recall: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExper...
Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correl...
Linking experiment and
computation
Single-crystal diffuse scattering
Defect structure in disordered materials.
(Osborn, Wi...
1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer input...
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
ex...
Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
51
For more information: foster@anl.gov
Thanks to co-authors and Globus team
Globus services (globus.org)
 Foster, I. Globus...
Upcoming SlideShare
Loading in …5
×

Accelerating Data-driven Discovery in Energy Science

980 views

Published on

A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Accelerating Data-driven Discovery in Energy Science

  1. 1. Ian Foster Accelerating data-driven discovery in energy science Distinguished Fellow
  2. 2. Life Sciences and Biology Advanced MaterialsCondensed Matter Physics Chemistry and Catalysis Soft Materials Environmental and Geo Sciences Can we determine pathways that lead to novel states and nonequilibrium assemblies? Can we observe – and control – nanoscale chemical transformations in macroscopic systems? Can we create new materials with extraordinary properties – by engineering defects at the atomic scale? Can we map – and ultimately harness – dynamic heterogeneity in complex correlated systems? Can we unravel the secrets of biological function – across length scales? Can we understand physical and chemical processes in the most extreme environments? 2 New tools are needed to answer the most pressing scientific Qs
  3. 3. The resulting data deluge Spans biology, climate, cosmology, materials, physics, urban sciences, … Simulation data Petascale  exascale simulations; simulation datasets as laboratories; high-throughput characterization; etc. Experimental data Light sources, genome sequencing, next-gen ARM radar, sky surveys, high-throughput experiments, etc. New research methods that depend on coupling 1) Of computation and experiment 2) Across data sources and types - inverse problems, computer control - knowledge integration, analysis
  4. 4. Scientific progress requires collaborative discovery engines informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis & visualization Integrated databases Rick Stevens
  5. 5. Example: A discovery engine for disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  6. 6. Accelerating data-driven discovery in energy science (1) Eliminate data friction
  7. 7. Eliminating data friction is essential to modern science Civilization advances by extending the number of important operations which we can perform without thinking about them (Whitehead, 1912) Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)
  8. 8. Software as a service (SaaS) as lubricant Customer relationship management (CRM): A knowledge-intensive process Historically, handled manually or via expensive, inflexible on- premise software SaaS has revolutionized how CRM is consumed  Outsource to provider who runs software on cloud  Access via simple interfaces Ease of use Cost Flexibility SaaS On-premise
  9. 9. Globus: Research data management as a service Essential research data management services  File transfer  Data sharing  Data publication  Identity and groups Builds on 15 years of DOE research Outsourced and automated  High availability, reliability, performance, scalability  Convenient for  Casual users: Web interfaces  Power users: APIs  Administrators: Install, manage globus.org
  10. 10. “I need to easily, quickly, & reliably move data to other locations.” Research Computing HPC Cluster Lab Server Campus Home Filesystem Desktop Workstation Personal Laptop DOE supercomputer Public Cloud 10
  11. 11. “I need to get data from a scientific instrument to my analysis system.” Next Gen Sequencer Light Sheet Microscope MRI Advanced Light Source 11
  12. 12. “I need to easily and securely share my data with my colleagues.” 12
  13. 13. Globus and the research data lifecycle Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Dublin core and domain- specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • SaaS  Only a web browser required • Use storage system of your choice • Access using your campus credentials 13
  14. 14. Globus at a glance 4 major services 13 national labs use Globus services 100 PB petabytes transferred 8,000 active endpoints 20 billion files processed >300 users are active daily 25,000 registered users 99.95% uptime over the past two years >30 subscribers The biggest transfer to date is 1 petabyte The longest- running transfer to date took 3 months We’re eager to learn what you want to do with Globus services
  15. 15. 15 One APS node connects to 125 locations thru mid 2014
  16. 16. Same node (1 Gbps link)
  17. 17. Globus and DOE: Terabytes per month
  18. 18. Globus and DOE: Running total terabytes
  19. 19. Globus and DOE: Active users per month
  20. 20. Response has been gratifying "Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory "Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates, Professional Research Assistant, NOAA “…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing "... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" - Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications "We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory "The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user "I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley National Lab "We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator, Oak Ridge National Laboratory "Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user "The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC support team, NCSA
  21. 21. Globus service APIs serve as a science platform Identity, Group, and Profile Management … Globus Toolkit GlobusAPIs GlobusConnect Data Publication & Discovery File Sharing File Transfer & Replication 21
  22. 22. Globus platform services enable new application capabilities
  23. 23. Publication as service for ACME
  24. 24. Globus platform accelerates development of new services
  25. 25. Operating a sustainable service Globus is a not-for-profit service for researchers We adopt a subscription- supported freemium model Subscribers get extra features, rapid support We’re engaged in crossing the chasm Support from DOE will contribute to long-term success
  26. 26. Accelerating data-driven discovery in energy science (2) Liberate scientific data
  27. 27. Q: What is the biggest obstacle to data sharing in science? A: The vast majority of data that is lost, or not online; if online, not described; if described, not indexed Not accessible Not discoverable Not used Contrast with common practice for consumer photos (iPhoto)  Automated capture  Publish then curate  Processing to add value  Outsourced storage
  28. 28. We must automate the capture, linking, and indexing of all data Globus publication service encodes and automates data publication pipelines Example application: Materials Data Facility for materials simulation and experiment data Proposed distributed virtual collections index, organize, tag, & manage distributed data Think iPhoto on steroids – backed by domain knowledge and supercomputing power
  29. 29. We must automate the capture, linking, and indexing of all data chiDB: Human-computer collaboration to extract Flory- Huggins (𝞆) parameters from polymers literature R. Tchoua et al. Plenario: Spatially and temporally integrated, linked, and searchable database of urban data C. Catlett, B. Goldstein, T. Malik et al.
  30. 30. “I need to publish my data so that others can find it and use it.” Scholarly Publication Reference Dataset Research Community Collaboration 30
  31. 31. Publish dashboard 31
  32. 32. Start a new submission 32
  33. 33. 33 Describe submission: 1) Dublin Core
  34. 34. 34 Describe submission: 2) Science metadata
  35. 35. Assemble the dataset 35
  36. 36. 36 Transfer files to submission endpoint
  37. 37. 37 Check dataset is assembled correctly
  38. 38. Submission now in curation workflow 38
  39. 39. Search published datasets 39
  40. 40. Search across collections
  41. 41. Discover a published dataset 41
  42. 42. Select a published dataset 42
  43. 43. View downloaded dataset 43
  44. 44. Configuring a publication pipeline: Publication “facets” URL Handle DOI identifier none standard custom description domain-specific none acceptance machine-validated curation human-validated anonymous Public collaborators access embargoed transient project lifetime “forever” preservation archive 44
  45. 45. Accelerating data-driven discovery in energy science (3) Create discovery engines at DOE facilities
  46. 46. Recall: A discovery engine for disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  47. 47. Simulation Characterize, Predict Assimilate Steer data acquisition Data analysis Reconstruct, detect features, auto-correlate, particle distributions, … Science automation services Scripting, security, storage, cataloging, transfer ~0.001-0.5 GB/s/flow ~2 GB/s total burst ~200 TB/month ~10 concurrent flows (Today: x10 in 5 yrs) Integration Optimize, fit, … Configure Check Guide Batch Immediate 0.001 1 100+ PFlops Precompute material database Reconstruct image Auto- correlation Feature detection Scientific opportunities  Probe material structure and function at unprecedented scales Technical challenges  Many experimental modalities  Data rates and computation needs vary widely; increasing  Knowledge management, integration, synthesis Towards discovery engines for energy science (Argonne LDRD)
  48. 48. Linking experiment and computation Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP). Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.) Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes, vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime. X-ray nano/microtomography Bio, geo, and material science imaging. (Bicer, Gursoy, Kettimuthu, De Carlo, et al.). Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response 2-BM 1-ID 6-ID Populate Sim Sim Select Sim Microstructure of a copper wire, 0.2mm diameter Advanced Photon Source Experimental and simulated scattering from manganite
  49. 49. 1: Run script (EL1.layer) 2. Lookup file name=EL1.layer user=Anton type=reconstruction Storage locations 3: Transfer inputs Compute facilities 4: Run app 6: Update catalogs 5: Transfer results External collaborators Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 49 Researchers Tying it all together: An energy sciences infrastructure
  50. 50. informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis & visualization Integrated databases Summary: Big opportunities and challenges for energy data Immediate opportunities  Reduce data friction and accelerate discovery by deploying Globus services across all DOE facilities  Develop new services to capture, link energy data Important research agenda  Discovery engines to answer major scientific questions  New research modalities linking computation and data  Organization and analysis of massive science data
  51. 51. Thank you to our sponsors! U.S. DEPARTMENT OF ENERGY 51
  52. 52. For more information: foster@anl.gov Thanks to co-authors and Globus team Globus services (globus.org)  Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.  Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer, Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.  Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014. Publication (globus.org/data-publication)  Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I., Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015 Discovery engines  Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde, M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.

×