Ian Foster
Accelerating discovery
via science services
Life Sciences
and Biology
Advanced
MaterialsCondensed
Matter Physics
Chemistry and
Catalysis
Soft Materials
Environmental
and Geo
Sciences
Can we determine
pathways that lead
to novel states and
nonequilibrium
assemblies?
Can we observe –
and control –
nanoscale chemical
transformations in
macroscopic
systems?
Can we create new materials with
extraordinary properties – by engineering
defects at the atomic scale?
Can we map – and
ultimately harness –
dynamic heterogeneity
in complex correlated
systems?
Can we unravel the
secrets of biological
function – across
length scales?
Can we understand
physical and chemical
processes in the most
extreme environments?
2
We want to accelerate progress
on the most pressing questions
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
The discovery process is
iterative and time-consuming
Pose
question
J.C.R Licklider, 1960:
About 85% of my
“thinking” time
was spent getting into
a position to think,
to make a decision,
to learn something I
needed to know
Outsourcing
for economies of
scale in the use of
automated methods
Automation
to apply more
sophisticated
methods at larger
scales
Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”—Rolf Heuer, CERN DG
10s of PB, 100s of institutions,
1000s of scientists,
100Ks of CPUs,
Bs of tasks
Outsourcing and automation:
(1) The Grid
Outsourcing and automation:
(2) The Cloud
The Software as a Service
(SaaS) revolution
Customer relationship
management (CRM):
A knowledge-intensive process
Historically, handled manually
or via expensive, inflexible on-
premise software
SaaS has revolutionized
how CRM is consumed
 Outsource to provider who
runs software on cloud
 Access via simple interfaces
Ease of use Cost
Flexibility
SaaS
On-premise
Where can we automate and
outsource in science broadly?
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar data
Link to literature
Analyze data
Publish data
Automate
and
outsource
Science
services
Many services are used by
science, but have limitations
Science services exist, but do
not address whole life cycle
Accelerating
discovery
via science services
(1) Eliminate data friction
The elimination of data friction
is a key to faster discovery
Civilization advances
by extending the number
of important operations
which we can perform
without thinking about
them (Whitehead, 1912)
Obstacles to data access,
movement, discovery,
sharing, and analysis slow
research, distort research
directions, and waste time
(DOE reports, 2005-2015)
We have the highways but not
the delivery service
Our highways
encompass the Internet,
ultra-high-speed networks,
science DMZs, data
transfer nodes, high-speed
transport protocols
A good delivery service
automates, schedules,
accelerates, adapts.
It provides APIs for
experts and casual users.
Cuts costs and saves
Globus: Research data
management as a service
Essential research data
management services
 File transfer
 Data sharing
 Data publication
 Identity and groups
Builds on 15 years of DOE
research
Outsourced and automated
 High availability, reliability,
performance, scalability
 Convenient for
 Casual users: Web interfaces
 Power users: APIs
 Administrators: Install, manage
globus.org
“I need to easily, quickly, & reliably
move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
DOE supercomputer
Public Cloud
16
17
One APS node
connects to
125 locations
“I need to get data from a scientific
instrument to my analysis system.”
Next Gen
Sequencer
Light Sheet Microscope
MRI Advanced
Light Source
18
“I need to easily and securely
share my data with my colleagues.”
19
Globus and the research data lifecycle
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Dublin
core and domain-
specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• SaaS  Only a web
browser required
• Use storage system
of your choice
• Access using your
campus credentials
20
Globus and DOE:
Terabytes per month
5
major
services
130
federated
campus IdPs
115
petabytes
transferred
8,000
managed
storage systems
20 billion
files
processed
99.95%
uptime over
past 2 years
25,000
registered
users
>30
institutional
subscribers
3 months
longest
transfer
1 petabyte
biggest
transfer
50M
most files in
one transfer
13
national labs
use services
Globus by the numbers
Accelerating
discovery
via science services
(2) Create platform services
Globus service APIs provide
elements of a science platform
Identity, Group, and
Profile Management
…
Globus Toolkit
GlobusAPIs
GlobusConnect
Data Publication & Discovery
File Sharing
File Transfer & Replication
25
Publication as service for ACME
climate modeling consortium
kbase.us
Accelerating
discovery
via science services
(3) Liberate scientific data
Q: What is the biggest obstacle
to data sharing in science?
A: The vast majority of data
that is lost, or not online;
if online, not described;
if described, not indexed
Not accessible
Not discoverable
Not used
Contrast with common practice
for consumer photos (iPhoto)
 Automated capture
 Publish then curate
 Processing to add value
 Outsourced storage
We must automate the capture,
linking, and indexing of all data
Globus publication service
encodes and automates data
publication pipelines
Example application: Materials
Data Facility for materials
simulation and experiment data
Proposed distributed virtual
collections index, organize,
tag, & manage distributed data
Think iPhoto on steroids –
backed by domain knowledge
and supercomputing power
We must automate the capture,
linking, and indexing of all data
chiDB: Human-computer
collaboration to extract Flory-
Huggins (𝞆) parameters from
polymers literature
R. Tchoua et al.
Plenario: Spatially and
temporally integrated, linked,
and searchable database of
urban data
C. Catlett, B. Goldstein, T. Malik et al.
Flory-Huggins parameters liberated!
R. Tchoua, J. De Pablo
“I need to publish my data so that
others can find it and use it.”
Scholarly
Publication
Reference
Dataset
Research
Community
Collaboration
Publish dashboard
35
Configuring a publication
pipeline: Publication “facets”
URL Handle DOI
identifier
none standard custom
description
domain-specific
none acceptance machine-validated
curation
human-validated
anonymous Public collaborators
access
embargoed
transient project lifetime “forever”
preservation
archive
36
Accelerating
discovery
via science services
(4) Create discovery engines
Data-driven science requires
collaborative discovery engines
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Rick Stevens
Example: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La
60%
Sr
40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
Integrate data movement, management, workflow,
and computation to accelerate data-driven
applications, organize data for efficient use
New architectures and methods create
opportunities and challenges
Integrate statistics/machine learning to assess
many models and calibrate them against `all'
relevant data
New computer facilities enable on-demand
computing and high-speed analysis of large
quantities of data
Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correlate,
particle
distributions, …
Science automation services
Scripting, security, storage, cataloging, transfer
~0.001-0.5 GB/s/flow
~2 GB/s total burst
~200 TB/month
~10 concurrent flows
(Today: x10 in 5 yrs)
Integration
Optimize, fit, …
Configure
Check
Guide
Batch
Immediate
0.001 1 100+
PFlops
Precompute
material
database
Reconstruct
image
Auto-
correlation
Feature
detection
Scientific opportunities
 Probe material structure and
function at unprecedented scales
Technical challenges
 Many experimental modalities
 Data rates and computation
needs vary widely; increasing
 Knowledge management,
integration, synthesis
Towards discovery engines for
energy science (Argonne LDRD)
Linking experiment and
computation
Single-crystal diffuse scattering
Defect structure in disordered materials.
(Osborn, Wilde, Wozniak, et al.)
Estimate structure via inverse modeling:
many-simulation evolutionary optimization on
100K+ BG/Q cores (Swift+OpenMP).
Near-field high-energy X-ray diffraction microscopy
Microstructure in bulk materials (Almer, Sharma, et al.)
Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,
vs. >5 hours on APS cluster or months if data taken home. Used to
detect errors in one run that would have resulted in total waste of beamtime.
X-ray nano/microtomography
Bio, geo, and material science imaging.
(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).
Innovative in-slice parallelization method gives
reconstruction of 360x2048x1024 dataset in ~1
minute, using 32K BG/Q cores, vs. many days
on cluster: enables quasi-instant response
2-BM
1-ID
6-ID
Populate
Sim Sim
Select
Sim
Microstructure of a copper
wire, 0.2mm diameter
Advanced
Photon Source
Experimental and simulated
scattering from manganite
Rapid assessment of alignment quality
in high-energy diffraction microscopy
Blue Gene/Q
Orthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis Pass
FitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/T
GO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow ProgressWorkflow
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma
Justin Wozniak
Mike Wilde
Jon Almer
Science services raise research
and policy questions
 What else can we automate and outsource?
 How do we choose opportunities?
 How do we measure success?
 How must our computer systems evolve?
 High-capacity discovery engines: where, how?
 What will science become in a services era?
 Will it be more democratic? Collaborative?
Entrepreneurial? More or less creative?
 What are implications for trust and reproducibility?
What would Beer say?
The question which asks how to
use the computer in the
enterprise, is, in short, the wrong
question. A better formulation is
to ask how the enterprise should
be run given that computers exist.
The best version of all is the
question asking what, given
computers, the enterprise now
is. – Stafford Beer, “Brain of the
Firm”, 1972
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Opportunities and challenges
for discovery acceleration
Immediate opportunities
 Reduce data friction and
accelerate discovery by
applying Globus services
across DOE facilities
 Develop new services to
capture, link science data
Important research agenda
 Discovery engines to answer
major scientific questions
 New research modalities
linking computation and data
 Organization and analysis of
massive science data
Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
47
For more information: foster@anl.gov
Thanks to co-authors and Globus team
Globus services (globus.org)
 Foster, I. Globus Online: Accelerating and democratizing science through
cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
 Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,
Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55,
2014.
 Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for
Collaborative Science Applications. Concurrency - Practice and Experience,
27(2):290-305, 2014.
Publication (globus.org/data-publication)
 Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,
Globus Data Publication as a Service: Lowering Barriers to Reproducible
Science. 11th IEEE International Conference on eScience Munich, Germany, 2015
Discovery engines
 Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,
M. and Wozniak, J. Networking materials data: Accelerating discovery at an
experimental facility. Big Data and High Performance Computing, 2015.
Questions?
foster@anl.gov

Accelerating Discovery via Science Services

  • 1.
  • 2.
    Life Sciences and Biology Advanced MaterialsCondensed MatterPhysics Chemistry and Catalysis Soft Materials Environmental and Geo Sciences Can we determine pathways that lead to novel states and nonequilibrium assemblies? Can we observe – and control – nanoscale chemical transformations in macroscopic systems? Can we create new materials with extraordinary properties – by engineering defects at the atomic scale? Can we map – and ultimately harness – dynamic heterogeneity in complex correlated systems? Can we unravel the secrets of biological function – across length scales? Can we understand physical and chemical processes in the most extreme environments? 2 We want to accelerate progress on the most pressing questions
  • 3.
  • 4.
    J.C.R Licklider, 1960: About85% of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
  • 5.
    Outsourcing for economies of scalein the use of automated methods Automation to apply more sophisticated methods at larger scales
  • 6.
    Higgs discovery “onlypossible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG 10s of PB, 100s of institutions, 1000s of scientists, 100Ks of CPUs, Bs of tasks Outsourcing and automation: (1) The Grid
  • 7.
  • 8.
    The Software asa Service (SaaS) revolution Customer relationship management (CRM): A knowledge-intensive process Historically, handled manually or via expensive, inflexible on- premise software SaaS has revolutionized how CRM is consumed  Outsource to provider who runs software on cloud  Access via simple interfaces Ease of use Cost Flexibility SaaS On-premise
  • 9.
    Where can weautomate and outsource in science broadly? Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data Automate and outsource Science services
  • 10.
    Many services areused by science, but have limitations
  • 11.
    Science services exist,but do not address whole life cycle
  • 12.
  • 13.
    The elimination ofdata friction is a key to faster discovery Civilization advances by extending the number of important operations which we can perform without thinking about them (Whitehead, 1912) Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)
  • 14.
    We have thehighways but not the delivery service Our highways encompass the Internet, ultra-high-speed networks, science DMZs, data transfer nodes, high-speed transport protocols A good delivery service automates, schedules, accelerates, adapts. It provides APIs for experts and casual users. Cuts costs and saves
  • 15.
    Globus: Research data managementas a service Essential research data management services  File transfer  Data sharing  Data publication  Identity and groups Builds on 15 years of DOE research Outsourced and automated  High availability, reliability, performance, scalability  Convenient for  Casual users: Web interfaces  Power users: APIs  Administrators: Install, manage globus.org
  • 16.
    “I need toeasily, quickly, & reliably move data to other locations.” Research Computing HPC Cluster Lab Server Campus Home Filesystem Desktop Workstation Personal Laptop DOE supercomputer Public Cloud 16
  • 17.
    17 One APS node connectsto 125 locations
  • 18.
    “I need toget data from a scientific instrument to my analysis system.” Next Gen Sequencer Light Sheet Microscope MRI Advanced Light Source 18
  • 19.
    “I need toeasily and securely share my data with my colleagues.” 19
  • 20.
    Globus and theresearch data lifecycle Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Dublin core and domain- specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • SaaS  Only a web browser required • Use storage system of your choice • Access using your campus credentials 20
  • 21.
  • 23.
    5 major services 130 federated campus IdPs 115 petabytes transferred 8,000 managed storage systems 20billion files processed 99.95% uptime over past 2 years 25,000 registered users >30 institutional subscribers 3 months longest transfer 1 petabyte biggest transfer 50M most files in one transfer 13 national labs use services Globus by the numbers
  • 24.
  • 25.
    Globus service APIsprovide elements of a science platform Identity, Group, and Profile Management … Globus Toolkit GlobusAPIs GlobusConnect Data Publication & Discovery File Sharing File Transfer & Replication 25
  • 27.
    Publication as servicefor ACME climate modeling consortium
  • 28.
  • 29.
  • 30.
    Q: What isthe biggest obstacle to data sharing in science? A: The vast majority of data that is lost, or not online; if online, not described; if described, not indexed Not accessible Not discoverable Not used Contrast with common practice for consumer photos (iPhoto)  Automated capture  Publish then curate  Processing to add value  Outsourced storage
  • 31.
    We must automatethe capture, linking, and indexing of all data Globus publication service encodes and automates data publication pipelines Example application: Materials Data Facility for materials simulation and experiment data Proposed distributed virtual collections index, organize, tag, & manage distributed data Think iPhoto on steroids – backed by domain knowledge and supercomputing power
  • 32.
    We must automatethe capture, linking, and indexing of all data chiDB: Human-computer collaboration to extract Flory- Huggins (𝞆) parameters from polymers literature R. Tchoua et al. Plenario: Spatially and temporally integrated, linked, and searchable database of urban data C. Catlett, B. Goldstein, T. Malik et al.
  • 33.
  • 34.
    “I need topublish my data so that others can find it and use it.” Scholarly Publication Reference Dataset Research Community Collaboration
  • 35.
  • 36.
    Configuring a publication pipeline:Publication “facets” URL Handle DOI identifier none standard custom description domain-specific none acceptance machine-validated curation human-validated anonymous Public collaborators access embargoed transient project lifetime “forever” preservation archive 36
  • 37.
  • 38.
    Data-driven science requires collaborativediscovery engines informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis & visualization Integrated databases Rick Stevens
  • 39.
    Example: A discoveryengine for disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  • 40.
    Integrate data movement,management, workflow, and computation to accelerate data-driven applications, organize data for efficient use New architectures and methods create opportunities and challenges Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
  • 41.
    Simulation Characterize, Predict Assimilate Steer data acquisition Data analysis Reconstruct, detectfeatures, auto-correlate, particle distributions, … Science automation services Scripting, security, storage, cataloging, transfer ~0.001-0.5 GB/s/flow ~2 GB/s total burst ~200 TB/month ~10 concurrent flows (Today: x10 in 5 yrs) Integration Optimize, fit, … Configure Check Guide Batch Immediate 0.001 1 100+ PFlops Precompute material database Reconstruct image Auto- correlation Feature detection Scientific opportunities  Probe material structure and function at unprecedented scales Technical challenges  Many experimental modalities  Data rates and computation needs vary widely; increasing  Knowledge management, integration, synthesis Towards discovery engines for energy science (Argonne LDRD)
  • 42.
    Linking experiment and computation Single-crystaldiffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP). Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.) Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes, vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime. X-ray nano/microtomography Bio, geo, and material science imaging. (Bicer, Gursoy, Kettimuthu, De Carlo, et al.). Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response 2-BM 1-ID 6-ID Populate Sim Sim Select Sim Microstructure of a copper wire, 0.2mm diameter Advanced Photon Source Experimental and simulated scattering from manganite
  • 43.
    Rapid assessment ofalignment quality in high-energy diffraction microscopy Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total feedback to experiment Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow ProgressWorkflow Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma Justin Wozniak Mike Wilde Jon Almer
  • 44.
    Science services raiseresearch and policy questions  What else can we automate and outsource?  How do we choose opportunities?  How do we measure success?  How must our computer systems evolve?  High-capacity discovery engines: where, how?  What will science become in a services era?  Will it be more democratic? Collaborative? Entrepreneurial? More or less creative?  What are implications for trust and reproducibility?
  • 45.
    What would Beersay? The question which asks how to use the computer in the enterprise, is, in short, the wrong question. A better formulation is to ask how the enterprise should be run given that computers exist. The best version of all is the question asking what, given computers, the enterprise now is. – Stafford Beer, “Brain of the Firm”, 1972
  • 46.
    informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis& visualization Integrated databases Opportunities and challenges for discovery acceleration Immediate opportunities  Reduce data friction and accelerate discovery by applying Globus services across DOE facilities  Develop new services to capture, link science data Important research agenda  Discovery engines to answer major scientific questions  New research modalities linking computation and data  Organization and analysis of massive science data
  • 47.
    Thank you toour sponsors! U.S. DEPARTMENT OF ENERGY 47
  • 48.
    For more information:foster@anl.gov Thanks to co-authors and Globus team Globus services (globus.org)  Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.  Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer, Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.  Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014. Publication (globus.org/data-publication)  Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I., Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015 Discovery engines  Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde, M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.
  • 49.

Editor's Notes

  • #2 One useful thing One exciting initiative
  • #3 New tools are needed to answer the most pressing scientific questions
  • #4 The basic research process remains essentially unchanged since the emergence of the scientific method in the 17th Century. Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations. Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically. Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway. It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  • #5 Thanks.
  • #7 Need for entirely new instruments, computing infrastructure, organizational structures 173 TB/day
  • #17 Change to OLCF, NERSC
  • #26 Highlight XSEDE’s planned adoption of user, group and profile management
  • #27 RDA: outsource data sharing and transfer
  • #36 The publish dashboard shows all current submissions at any stage of the submission workflow. Here users can view accepted submissions, see a list of all submissions currently in the curation process, view/edit their unfinished submissions, and start a new submission. "The Scientist" will now start a new submission.
  • #37 Description: another aspect - general metadata (Dublin Core) and scientific metadata Curation: another aspect – self, project owner, librarian
  • #39 Accelerate “knowledge turns.” Unleash the 99% of not-easily-accessible data. Integrate data and computation.
  • #40 Fix IMAGE “Most of materials science is bottlenecked by disordered structures”—Littlewood. Solve inverse problem. How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. Challenge: takes months to do a single loop through cycle. Just as important, it is an incredibly labor intensive and expensive process.
  • #41 This picture shows the big picture.
  • #43 Add CNM - Innovative in-slice parallelization method permits reconstruction of 720x2160x2560 dataset (7-BM) in less than 3 minutes (for each iteration), using 34K BG/Q cores, vs. many days on typical cluster. Innovative in-slice parallelization for iterative algorithms permits large-scale image reconstruction. Execution times are reduced to minutes for many large datasets and algorithms using 32K BG/Q cores, vs. many days on typical cluster.
  • #44 DS, NF-HEDM, FF-HEDM, PD workflows operational Catalog integrated into workflow, supports rich user interface Workflows use large-scale compute resources outside of APS Data publication service demonstrated Parallel algs for 3-D image reconstruction, structure determination, etc. Globus Galaxies platform integrated with Swift for scalability
  • #48 Talk about the Globus as being part of UChicago + ANL, as well as other context setting about how this work came about and is funded
  • #50 Thanks.