Accelerating Data-driven Discovery in Energy Science

Ian Foster
Accelerating
data-driven discovery
in energy science
Distinguished Fellow

Life Sciences
and Biology
Advanced
MaterialsCondensed
Matter Physics
Chemistry and
Catalysis
Soft Materials
Environmental
and Geo
Sciences
Can we determine
pathways that lead
to novel states and
nonequilibrium
assemblies?
Can we observe –
and control –
nanoscale chemical
transformations in
macroscopic
systems?
Can we create new materials with
extraordinary properties – by engineering
defects at the atomic scale?
Can we map – and
ultimately harness –
dynamic heterogeneity
in complex correlated
systems?
Can we unravel the
secrets of biological
function – across
length scales?
Can we understand
physical and chemical
processes in the most
extreme environments?
2
New tools are needed to answer
the most pressing scientific Qs

The resulting data deluge
Spans biology, climate, cosmology, materials,
physics, urban sciences, …
Simulation data
Petascale  exascale simulations;
simulation datasets as laboratories;
high-throughput characterization; etc.
Experimental data
Light sources, genome sequencing,
next-gen ARM radar, sky surveys,
high-throughput experiments, etc.
New research methods that depend on coupling
1) Of computation and experiment 2) Across data sources and types
- inverse problems, computer control - knowledge integration, analysis

Scientific progress requires
collaborative discovery engines
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Rick Stevens

Example: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La
60%
Sr
40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization

Accelerating
in energy science
(1) Eliminate data friction

Eliminating data friction is
essential to modern science
Civilization advances
by extending the number
of important operations
which we can perform
without thinking about
them (Whitehead, 1912)
Obstacles to data access,
movement, discovery,
sharing, and analysis slow
research, distort research
directions, and waste time
(DOE reports, 2005-2015)

Software as a service (SaaS)
as lubricant
Customer relationship
management (CRM):
A knowledge-intensive process
Historically, handled manually
or via expensive, inflexible on-
premise software
SaaS has revolutionized
how CRM is consumed
 Outsource to provider who
runs software on cloud
 Access via simple interfaces
Ease of use Cost
Flexibility
SaaS
On-premise

Globus: Research data
management as a service
Essential research data
management services
 File transfer
 Data sharing
 Data publication
 Identity and groups
Builds on 15 years of DOE
research
Outsourced and automated
 High availability, reliability,
performance, scalability
 Convenient for
 Casual users: Web interfaces
 Power users: APIs
 Administrators: Install, manage
globus.org

“I need to easily, quickly, & reliably
move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
DOE supercomputer
Public Cloud
10

“I need to get data from a scientific
instrument to my analysis system.”
Next Gen
Sequencer
Light Sheet Microscope
MRI Advanced
Light Source
11

“I need to easily and securely
share my data with my colleagues.”
12

Globus and the research data lifecycle
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Dublin
core and domain-
specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• SaaS  Only a web
browser required
• Use storage system
of your choice
• Access using your
campus credentials
13

Globus at a glance
4
major services
13
national labs
use Globus
services
100 PB
petabytes transferred
8,000
active endpoints
20 billion
files processed
>300
users are active
daily
25,000
registered users
99.95%
uptime over the
past two years
>30
subscribers
The biggest
transfer to date is
1 petabyte
The longest-
running transfer to
date took
3 months
We’re eager to
learn what
you
want to do with
Globus services

15
One APS node
connects to
125 locations
thru mid 2014

Globus and DOE:
Terabytes per month

Globus and DOE:
Running total terabytes

Globus and DOE:
Active users per month

Response has been gratifying
"Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory
"Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates,
Professional Research Assistant, NOAA
“…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was
trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing
"... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" -
Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications
"We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service
was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory
"The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user
"I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley
National Lab
"We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator,
Oak Ridge National Laboratory
"Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user
"The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC
support team, NCSA

Globus service APIs serve
as a science platform
Identity, Group, and
Profile Management
…
Globus Toolkit
GlobusAPIs
GlobusConnect
Data Publication & Discovery
File Sharing
File Transfer & Replication
21

Globus platform
services enable new
application capabilities

Publication as service for ACME

Globus platform
accelerates development
of new services

Operating a sustainable service
Globus is a not-for-profit
service for researchers
We adopt a subscription-
supported freemium model
Subscribers get extra
features, rapid support
We’re engaged in crossing
the chasm
Support from DOE will
contribute to long-term
success

Accelerating
in energy science
(2) Liberate scientific data

Q: What is the biggest obstacle
to data sharing in science?
A: The vast majority of data
that is lost, or not online;
if online, not described;
if described, not indexed
Not accessible
Not discoverable
Not used
Contrast with common practice
for consumer photos (iPhoto)
 Automated capture
 Publish then curate
 Processing to add value
 Outsourced storage

We must automate the capture,
linking, and indexing of all data
Globus publication service
encodes and automates data
publication pipelines
Example application: Materials
Data Facility for materials
simulation and experiment data
Proposed distributed virtual
collections index, organize,
tag, & manage distributed data
Think iPhoto on steroids –
backed by domain knowledge
and supercomputing power

We must automate the capture,
linking, and indexing of all data
chiDB: Human-computer
collaboration to extract Flory-
Huggins (𝞆) parameters from
polymers literature
R. Tchoua et al.
Plenario: Spatially and
temporally integrated, linked,
and searchable database of
urban data
C. Catlett, B. Goldstein, T. Malik et al.

“I need to publish my data so that
others can find it and use it.”
Scholarly
Publication
Reference
Dataset
Research
Community
Collaboration
30

33
Describe submission:
1) Dublin Core

34
Describe submission:
2) Science metadata

36
Transfer files to
submission endpoint

37
Check dataset is
assembled correctly

Submission now in curation
workflow
38

Discover a published dataset
41

Configuring a publication
pipeline: Publication “facets”
URL Handle DOI
identifier
none standard custom
description
domain-specific
none acceptance machine-validated
curation
human-validated
anonymous Public collaborators
access
embargoed
transient project lifetime “forever”
preservation
archive
44

Accelerating
in energy science
(3) Create discovery engines
at DOE facilities

Recall: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La
60%
Sr
40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization

Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correlate,
particle
distributions, …
Science automation services
Scripting, security, storage, cataloging, transfer
~0.001-0.5 GB/s/flow
~2 GB/s total burst
~200 TB/month
~10 concurrent flows
(Today: x10 in 5 yrs)
Integration
Optimize, fit, …
Configure
Check
Guide
Batch
Immediate
0.001 1 100+
PFlops
Precompute
material
database
Reconstruct
image
Auto-
correlation
Feature
detection
Scientific opportunities
 Probe material structure and
function at unprecedented scales
Technical challenges
 Many experimental modalities
 Data rates and computation
needs vary widely; increasing
 Knowledge management,
integration, synthesis
Towards discovery engines for
energy science (Argonne LDRD)

Linking experiment and
computation
Single-crystal diffuse scattering
Defect structure in disordered materials.
(Osborn, Wilde, Wozniak, et al.)
Estimate structure via inverse modeling:
many-simulation evolutionary optimization on
100K+ BG/Q cores (Swift+OpenMP).
Near-field high-energy X-ray diffraction microscopy
Microstructure in bulk materials (Almer, Sharma, et al.)
Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,
vs. >5 hours on APS cluster or months if data taken home. Used to
detect errors in one run that would have resulted in total waste of beamtime.
X-ray nano/microtomography
Bio, geo, and material science imaging.
(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).
Innovative in-slice parallelization method gives
reconstruction of 360x2048x1024 dataset in ~1
minute, using 32K BG/Q cores, vs. many days
on cluster: enables quasi-instant response
2-BM
1-ID
6-ID
Populate
Sim Sim
Select
Sim
Microstructure of a copper
wire, 0.2mm diameter
Advanced
Photon Source
Experimental and simulated
scattering from manganite

1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
Compute
facilities
4: Run app
6: Update catalogs
5: Transfer results
External
collaborators
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
49
Researchers
Tying it all together: An energy
sciences infrastructure

informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Summary: Big opportunities
and challenges for energy data
Immediate opportunities
 Reduce data friction and
accelerate discovery by
deploying Globus services
across all DOE facilities
 Develop new services to
capture, link energy data
Important research agenda
 Discovery engines to answer
major scientific questions
 New research modalities
linking computation and data
 Organization and analysis of
massive science data

Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
51

For more information: foster@anl.gov
Thanks to co-authors and Globus team
Globus services (globus.org)
 Foster, I. Globus Online: Accelerating and democratizing science through
cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
 Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,
Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55,
2014.
 Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for
Collaborative Science Applications. Concurrency - Practice and Experience,
27(2):290-305, 2014.
Publication (globus.org/data-publication)
 Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,
Globus Data Publication as a Service: Lowering Barriers to Reproducible
Science. 11th IEEE International Conference on eScience Munich, Germany, 2015
Discovery engines
 Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,
M. and Wozniak, J. Networking materials data: Accelerating discovery at an
experimental facility. Big Data and High Performance Computing, 2015.

Accelerating Data-driven Discovery in Energy Science

More Related Content

What's hot

Similar to Accelerating Data-driven Discovery in Energy Science

More from Ian Foster

Recently uploaded

Accelerating Data-driven Discovery in Energy Science

Editor's Notes