Data Sets, Ensemble Cloud Computing, and the University Library:Getting the Most Out of Research Support
Data Sets, Ensemble Cloud
Computing, and the University
Getting the Most Out of Research Support
Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert
McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4
1 School on
Information, University of Michigan, Ann Arbor, MI, United States.
School of Informatics and Computing, Indiana University, Bloomington, IN, United States.
3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States.
4 Data To Insight Center, Indiana University, Bloomington, IN, United States.
5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States.
• Technological advances are making it ever easier to
move computation, data, and metadata around
• With decreasing costs and increasing recognition of the
value of data re-use, many organization are exploring
their role in data curation/preservation
• If we look at the nature of the problem
– How should data be curated to scalably support research?
• Lifecycle approaches to manage value-defined research objects
– Can we do it?
• SEAD as an end-to-end demonstration…
– What organization(s) are best positioned/the most capable
of leading/providing such services long-term?
• Primary research organizations have a combination of capability,
motivation, and long-term commitment.
Technology – the world is flat
• Today’s researchers can employ computing
and data resources from anywhere, using
scalable search technologies …
Data as a key resource, Big Data
• Data is increasingly recognized as valuable
beyond its initial use:
Data mining/machine learning/…
– NSF Data plan requirement
– Paper publication with data requirements
– Community and institutional collections growing
Data Publication today
• Data cited in papers (to limited depth)
• Project file archives (large, limited description,
• Reference/analytical data (standardized content,
• Historical collections (temporal breadth, limited
- do any of these solve the problem?
Researchers think, and work, like this:
Raw and derived data
~5 levels of quality,
statistical ensembles, …
Also organized by
location, time, variables,
project, provenance, …
Large amount of
from external sources
Evidence for ‘nonorthogonal’ subcollections
What’s Really Needed?
Scalable Research Productivity Requires:
• A way to
– store what you want
– Reference what you want
– Organize how you want (search, filter, tag, collect)
• At the scale, and level of detail/richness, you want
• When you figure that out
• In a way that is self-describing/high-fidelity across
applications and owners
• In the vocabularies and formats you find efficient
• Beyond the lifetime of individual/project interest
• For active use and external credit
• With minimal training/IT support required.
How can we approach magic?
• Global identifiers – data, terms, metadata
• Content management abstractions (blob + type +
• Service architectures and automated processing
(conversion, preview, extraction, derivation, cataloging,
• Applications that share these abstractions – write what
you know, display/ignore what you don’t
• Research Object management (structured, interrelated collections)
Web 2.0, Web3.0, + explicit context management …
SEAD: Sustainable Environment Actionable Data
• An NSF DataNet project started in
• An international resource for
• A provider of light-weight Data Services
based on novel technical and business
– Supporting the long-tail of research
– Enabling active and social curation
– Providing integrated lifecycle support for data
Margaret Hedstrom, PI
Praveen Kumar, co-PI
Jim Myers, co-PI
Beth Plale, co-PI
• Data discovery
• Project workspaces
• A data-aware
• Curation and
that link to multiple archives and discovery
• An active repository that creates data pages with
• A tool for community exploration:
– Personal and
– Publications and
– Temporal analysis
• Curation and Preservation Services:
– Research Object
– ID assignment
– Matchmaking to
– Catalog Registration SEAD’s Virtual Archive allows curators to
access, assess, enhance, package, and submit
data from SEAD project repositories for long– Discovery services
term storage in SEAD-managed storage or
external institutional repositories and cloud
Apps read what they need and write what they know
Curation snapshots meaningful Research Objects
Multiple ROs can be defined/managed re-using the same underlying ‘living’ content
The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level
Flickr-style web management of data
Semantic Content Middleware
over Scalable File System and
workflows and services
Curation Services to harvest
and package specific data sets
Federation of OAI
• Research Objects have meaning/value but data comes in
• Research Objects are not orthogonal, but individual data
• Lifecycle approaches for datasets are becoming possible
• Managing intermixed ROs is the problem that needs to be
tackled to meet the research community’s needs
• Research Data Alliance (RDA) can help drive
What will drive research data preservation?
• The most valuable data service(s) are
active/actionable research service(s)…
– The ability to define Research Objects is more
important than any given RO
• Led by research organizations as part of their
– The only organizations with the focus, scope, and
scale to solve the whole problem (end-to-end
• SEAD Team @ UM, UI, IU
• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other
• and Thank You!
… stop by the SEAD booth and share your thoughts!
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.