2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Ridge National Laboratory)

DOIs and Supercomputing
DataCite Summer 2013 Meeting
Terry Jones, Sudharshan Vazhkudai, Doug Fuller
Oak Ridge National Laboratory

DataCite Summer 2013 / Washington DC
Why Supercomputers!?
Because Innovation Drives The Economy…
• Over the last 5 years, 38% of the international innovation “R&D 100” awards went
to US National Labs
0
5
10
15
20
25
30
35
40
45
50
2009 2010 2011 2012 2013
• This was done with YOUR tax
money
• Ideas shape the course of
history – John Maynard
Keynes
• The central goal of economic
policy should be to spur higher
productivity through greater
innovation – Joseph
Schumpeter‟s Innovation
Economics

Why Supercomputers!? (part 2)
…And in 2013, Supercomputers Drive Innovation
Computers have changed the way we conduct
experiments. Given enough computer power, we
can perform accurate experiments more
quickly, more cheaply, and often with greater
control.

The New Laboratory:
High-Performance Computing yields breakthroughs
H = -
2
2mi
Ñi
2
i=1
n
å -
eiej
riji¹j
n
å

Big Problems Require Big Solutions
Energy
Healthcare
Competitiveness
OLCF resources are available to
academia and industry through
open, peer-reviewed allocation
mechanisms.

• High Performance Production Computing for the
Office of Science
• Characterized by a large number of projects (over 400) and users (
over 4800)
• Leadership Computing for Open Science
• Characterized by a small number of projects ( about 50) and
users (about 800) with computationally intensive projects
• Linking it together – ESnet
• Investing in the future – R&E Prototypes
ESnet
Titan at ORNL
(#2)
Mira at ANL
(#5)
Hopper at LBNL
(#24)
June 2013
DOE Office of Science HPC User Facilities

DOE Office of Science HPC User Facilities

With Big Computations Comes Big Data
• DOE HPC User Facilities produce enormous volumes of data
• Each User Facility has tertiary (archival) storage, often HPSS
– statistics for one such computer center pictured here
• In addition, each center provides secondary storage
– for example: a 10PB Lustre parallel file system

• Part of a Collaborative DOE Office of
Science program at ORNL and ANL
• Mission: Provide the computational
and data resources required to solve
the most challenging problems.
• Access to the most powerful computer
in the world for open access computing
(Titan)
• Highly competitive user allocation
programs (INCITE, ALCC).
• Projects receive 10x to 100x more
resource than at other generally
available centers.
• OLCF centers partner with users to
enable science & engineering
breakthroughs (Liaisons, Catalysts).
Oak Ridge Leadership Computing Facility (OLCF)
-- A Leading DOE User Facility

We have increased our system capability
by 10,000 times since 2004
• Strong partnerships with supercomputer vendors.
• LCF users employ large portions of the machine for large fractions of time.
• Strong partnerships with our users to scale codes and algorithms.

OLCF Future (Based On Extrapolation)
Jaguar: 2.3 PF
Leadership
system for science
Titan (OLCF-3):
10–20 PF
Leadership system
2009 2012 2016 2019
OLCF-5:
1 EF
OLCF-4:
100–250 PF
• Computer system performance increases through parallelism
– Clock speed trend flat to slower over coming years
– In the last 28 years, systems have scaled from 64 cores to ~300,000
– Applications must utilize all inherent parallelism
• Our compute and data resources have grown 10,000X over the
decade, are in high demand, and are effectively used.

The Data Deluge
2013 4PB disk & 34PB tape [Titan]
2017 64PB disk & 600PB tape [Coral]
2021 1EB disk & 10EB tape (?)
• Key Challenge: Make Sense of So Much Data
• We‟ll Need Better Tools
• If “many hands make light work,” how can we
enable more people to make sense of the data?

What Breakthroughs Are We Missing?
• HPC will remain important to Scientific Discovery
– Important for Climate, Material Science, Energy Security
• Today, the state-of-the-art is (still!) bibliographic
publications
• But The Gains From Bibliographic Sharing Are
Limited
– Constraints in paper length
– Limited Focus of paper
– Limited ability to convey with graphs, figures, tables
• Urgently Needed: A Quick Way To „Enable‟ Data

New External Drivers for Supercomputing Centers
• The push is on to squeeze more results from High-Performance Computing
– Scientists have difficulty in replicating (or even understanding) other‟s results
– Tax payers want more openness
– The Holdren memo

Our Response: Make Supercomputer Produced
Data As Widely Available As Possible
• DOIs provide the necessary mechanism & implementation
• Makes sense for OLCF (uniquely qualified for 100TB datasets)
• Will benefit from DataCite‟s integration with Thomson Reuter‟s data citation index and
other services.
• Already successful for sensor-driven research like NASA
• As research goes forward, the project Principal Investigator stores “appropriate data”
– Presumably, if data can support a bibliographic result (graph, figure, data), the data is worth
curation.
• After curation, the data is available to the entire scientific community
✔ Helps OLCF with „research tracking‟
✔ Helps OLCF with „reporting to sponsors‟
✔ Helps OLCF resolve data disposition questions
✔ All The Traditional Benefits To Researchers

DOI BenefitsDOI Benefits
• Identify & Cite key data products
of interest and value, and
annotate them.
• Safely share data with their
collaborators even before
publishing the result in a
scientific communication.
• Future data analyses can easily
feed off of the data
products, fostering a highly
dynamic, and collaborative
environment.
From User‟s Perspective, DOIs can: From Sponsor‟s Perspective, DOIs can:
• Help with research tracking and identifying the major results coming
out of a project allocation on the center‟s resources.
• Aid in reporting to sponsors.
• Since the DOIs also capture some basic metadata along with the
index, it can help the center to answer questions on the disposition
of the data, search and discover them.
From Center‟s Perspective, DOIs can:
• Added benefit of seeing data sharing flourish within
the community, and more data analyses spawned from
the data products.
• Both users and centers that the sponsor funds now
have rich tools for data management.
• Preserve data products for a longer-term, much beyond the
expiration of their projects at the centers.
• Satisfy requirements from funding agencies on data management
plans in terms of long-term preservation, sharing and dissemination
of research results.
• DOIs enable more value
for the dollar spent. In
addition to software
tools, research
artifacts, and
papers, there is now a
new entity, the citable
data product.
• Better utilization of HPC
center resources.
• Provides a tool the to cull the data holdings. Provide tangible policies to users for long-term data preservation.
• Evolve to support “data-only” users through data science tools such as DOIs.
• Provide an opportunity for our center to distinguish itself from other centers (they have the best data tools)

Workflow for DOI Creation
1. User
creates data
2. User
requests DOI
3. ORNL
requests DOI
4. OSTI
provides DOI
5. DOI stored
at data portal
6. Request
Permanent
Data Copy
7. Data
Migrated to
Archive
8. Archive
success
response
9. DOI
success
response

Workflow for DOI Data Retrieval
1. User
provides
search criteria
4. Request
Data Subset
5. Data
Migrated for
Upload
2. Matches
found via
Metadata
3. User
identifies
needed data
6. User
retrieves data

Some Challenges Are Expected
• How will permanent data storage be funded?
– Projects last 3 years.
• Researchers are affiliated with institutions that have their own data policies.
– For example, the Princeton Plasma Physics Lab may have policies affecting how we can support
it‟s fusion projects.
• Some fields will require effort to make their data “portable” for a wide audience.
– Astrophysics has a standard file format, Fusion does not.
• Developing good metadata is a human intensive effort
– Getting PIs to provide the metadata
– Looking to OSTI & DataCite for some help with DOI Q&A

…More Challenges
• What about Authenticated access to data?
Or malicious users in general...
• What about the long-term QA aspects of
maintaining data?
• What about the logistics of very large data?
– Staging
– Retrieving huge files (can‟t be on disk)
Where’s The
Data?

Current Project Status
• Provided a DOI recommendation for the Center
– Pros and Cons
– Long term implications
• Designed the Workflow
• Created infrastructure to support the workflow
– Frontend infrastructure for storing & DOI association
– Backend infrastructure for search & retrieval
• Having conversations with a few selected HPC user communities
1. Astrophysics
2. Groundwater Simulation
3. Climate
4. Turbulence
5. Fusion

Summary
• High Performance Computing & Data are integral to scientific
discovery
• Bibliographic publications cannot contain the wealth of insight
available in the raw data
• ORNL is leading an effort to make HPC data available to all
with DOIs
• In the future, “Publish” to
a scientist will probably
refer to obtaining a DOI
for a supercomputer
dataset

Acknowledgements
• OLCF DOI Team
– Sudharshan Vazhkudai
– Doug Fuller
– Terry Jones
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak
Ridge National Laboratory, which is supported by the Office of Science of the U.S.
Department of Energy under Contract No. DE-AC05-00OR22725.
• OSTI Support
– Mark Martin
– Jannean Elliott
• ORNL Support
– Jack Wells
– Giri Palanisamy
– John Cobb
– Stan White

Questions?
trj@ornl.gov

Extra Viewgraphs

High-Temperature
Superconductivity Biofluidic Systems Plasma Physics Cosmology
Taking a Quantum Leap
in Time to Solution for
Simulations of High-TC
Superconductors
19 Petaflops
Simulation of Protein
Suspensions in
Crowding Conditions
Radiative Signatures
of the Relativistic
Kelvin-Helmholtz
Instability
HACC: Extreme
Scaling and
Performance Across
Diverse Architectures
Titan Titan Titan Sequoia, Mira, Titan
How Does The OLCF Compare With Other Centers?
Peter Staar
ETH Zurich
Massimo Bernaschi
ICNR-IAC Rome
Michael Bussmann
HZDR - Dresden
Salman Habib
ANL
Four of Six SC13 Gordon Bell Finalists Used Titan

The New Laboratory (continued):
High-Performance Computing is widely applicable

2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Ridge National Laboratory)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Ridge National Laboratory)

Similar to 2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Ridge National Laboratory) (20)

More from datacite

More from datacite (20)

Recently uploaded

Recently uploaded (20)

2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Ridge National Laboratory)

Editor's Notes