Data Sets, Ensemble Cloud
Computing, and the University
Library:
Getting the Most Out of Research Support
Jim Myers1, Marg...
Overview
• Technological advances are making it ever easier to
move computation, data, and metadata around
• With decreasi...
Technology – the world is flat
• Today’s researchers can employ computing
and data resources from anywhere, using
scalable...
Data as a key resource, Big Data
• Data is increasingly recognized as valuable
beyond its initial use:
–
–
–
–

Data repro...
Data Publication today
• Data cited in papers (to limited depth)
• Project file archives (large, limited description,
gray...
Researchers think, and work, like this:
• Multi
– Disciplinary
– Format
– Model
– Semantics
– Location
and this
–
–
–

–

–

–

Raw and derived data
~5 levels of quality,
processing, maturity
Observations,
calibrations,
exper...
What’s Really Needed?
Scalable Research Productivity Requires:
• A way to
– store what you want
– Reference what you want
...
How can we approach magic?
• Global identifiers – data, terms, metadata
• Content management abstractions (blob + type +
m...
SEAD: Sustainable Environment Actionable Data
• An NSF DataNet project started in
October, 2011
• An international resourc...
SEAD is:
• Data discovery
• Project workspaces
• A data-aware
community network
• Curation and
preservation services
that ...
SEAD is:
• An active repository that creates data pages with
–
–
–
–
–
–
–
–

Previews
Extracted Metadata
Overlays
Tags
Co...
SEAD is:
• A tool for community exploration:
– Personal and
Project Profiles
– Publications and
Data Citations
– Co-author...
SEAD is:
• Curation and Preservation Services:
– Research Object
management
– ID assignment
– Matchmaking to
long-term rep...
–
–
–
–

Apps read what they need and write what they know
Curation snapshots meaningful Research Objects
Multiple ROs can...
Key Points
• Research Objects have meaning/value but data comes in
smaller chunks
• Research Objects are not orthogonal, b...
What will drive research data preservation?
• The most valuable data service(s) are
active/actionable research service(s)…...
Acknowledgements
• SEAD Team @ UM, UI, IU
• NSF
• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other
sustainability researchers
•...
Upcoming SlideShare
Loading in...5
×

Data Sets, Ensemble Cloud Computing, and the University Library: Getting the Most Out of Research Support

292

Published on

A presentation given at AGU2013

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
292
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Sets, Ensemble Cloud Computing, and the University Library: Getting the Most Out of Research Support

  1. 1. Data Sets, Ensemble Cloud Computing, and the University Library: Getting the Most Out of Research Support Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4 myersjd@umich.edu 1 School on Information, University of Michigan, Ann Arbor, MI, United States. School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States. 2
  2. 2. Overview • Technological advances are making it ever easier to move computation, data, and metadata around • With decreasing costs and increasing recognition of the value of data re-use, many organization are exploring their role in data curation/preservation • If we look at the nature of the problem – How should data be curated to scalably support research? • Lifecycle approaches to manage value-defined research objects – Can we do it? • SEAD as an end-to-end demonstration… – What organization(s) are best positioned/the most capable of leading/providing such services long-term? • Primary research organizations have a combination of capability, motivation, and long-term commitment.
  3. 3. Technology – the world is flat • Today’s researchers can employ computing and data resources from anywhere, using scalable search technologies … Enough said.
  4. 4. Data as a key resource, Big Data • Data is increasingly recognized as valuable beyond its initial use: – – – – Data reproducibility Re-analysis Reference Data Data mining/machine learning/… – NSF Data plan requirement – Paper publication with data requirements – Community and institutional collections growing
  5. 5. Data Publication today • Data cited in papers (to limited depth) • Project file archives (large, limited description, gray/dark) • Reference/analytical data (standardized content, limited breadth) • Historical collections (temporal breadth, limited numbers) - do any of these solve the problem?
  6. 6. Researchers think, and work, like this: • Multi – Disciplinary – Format – Model – Semantics – Location
  7. 7. and this – – – – – – Raw and derived data ~5 levels of quality, processing, maturity Observations, calibrations, experiments, models, statistical ensembles, … Also organized by location, time, variables, technique, creator, project, provenance, … Large amount of reference information from external sources (e.g. NASA) Evidence for ‘nonorthogonal’ subcollections
  8. 8. What’s Really Needed? Scalable Research Productivity Requires: • A way to – store what you want – Reference what you want – Organize how you want (search, filter, tag, collect) • At the scale, and level of detail/richness, you want • When you figure that out • In a way that is self-describing/high-fidelity across applications and owners • In the vocabularies and formats you find efficient • Beyond the lifetime of individual/project interest • For active use and external credit • With minimal training/IT support required.
  9. 9. How can we approach magic? • Global identifiers – data, terms, metadata • Content management abstractions (blob + type + metadata) • Service architectures and automated processing (conversion, preview, extraction, derivation, cataloging, …) • Applications that share these abstractions – write what you know, display/ignore what you don’t • Research Object management (structured, interrelated collections) Web 2.0, Web3.0, + explicit context management …
  10. 10. SEAD: Sustainable Environment Actionable Data • An NSF DataNet project started in October, 2011 • An international resource for sustainability science • A provider of light-weight Data Services based on novel technical and business approaches: – Supporting the long-tail of research – Enabling active and social curation – Providing integrated lifecycle support for data http://sead-data.net/ Margaret Hedstrom, PI Praveen Kumar, co-PI Jim Myers, co-PI Beth Plale, co-PI
  11. 11. SEAD is: • Data discovery • Project workspaces • A data-aware community network • Curation and preservation services that link to multiple archives and discovery services
  12. 12. SEAD is: • An active repository that creates data pages with – – – – – – – – Previews Extracted Metadata Overlays Tags Comments Provenance Use information Download/Embed
  13. 13. SEAD is: • A tool for community exploration: – Personal and Project Profiles – Publications and Data Citations – Co-author, co-investigator graphs – Temporal analysis
  14. 14. SEAD is: • Curation and Preservation Services: – Research Object management – ID assignment – Matchmaking to long-term repositories Citation Generation – Catalog Registration SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for long– Discovery services term storage in SEAD-managed storage or external institutional repositories and cloud data services.
  15. 15. – – – – Apps read what they need and write what they know Curation snapshots meaningful Research Objects Multiple ROs can be defined/managed re-using the same underlying ‘living’ content The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level Flickr-style web management of data Sensor data Semantic Content Middleware over Scalable File System and Triple Store Geospatial, social network mash-ups, workflows and services Curation Services to harvest and package specific data sets Federation of OAI repositories for long-term preservation
  16. 16. Key Points • Research Objects have meaning/value but data comes in smaller chunks • Research Objects are not orthogonal, but individual data sets/files are • Lifecycle approaches for datasets are becoming possible • Managing intermixed ROs is the problem that needs to be tackled to meet the research community’s needs • Research Data Alliance (RDA) can help drive standardization/scaling
  17. 17. What will drive research data preservation? • The most valuable data service(s) are active/actionable research service(s)… – The ability to define Research Objects is more important than any given RO • Led by research organizations as part of their long-term mission? – The only organizations with the focus, scope, and scale to solve the whole problem (end-to-end research productivity)
  18. 18. Acknowledgements • SEAD Team @ UM, UI, IU • NSF • NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability researchers • and Thank You! … stop by the SEAD booth and share your thoughts! http://sead-data.net/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×