Your SlideShare is downloading. ×
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the Most Out of Research Support
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Sets, Ensemble Cloud Computing, and the University Library: Getting the Most Out of Research Support

255
views

Published on

A presentation given at AGU2013

A presentation given at AGU2013

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
255
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Sets, Ensemble Cloud Computing, and the University Library: Getting the Most Out of Research Support Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4 myersjd@umich.edu 1 School on Information, University of Michigan, Ann Arbor, MI, United States. School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States. 2
  • 2. Overview • Technological advances are making it ever easier to move computation, data, and metadata around • With decreasing costs and increasing recognition of the value of data re-use, many organization are exploring their role in data curation/preservation • If we look at the nature of the problem – How should data be curated to scalably support research? • Lifecycle approaches to manage value-defined research objects – Can we do it? • SEAD as an end-to-end demonstration… – What organization(s) are best positioned/the most capable of leading/providing such services long-term? • Primary research organizations have a combination of capability, motivation, and long-term commitment.
  • 3. Technology – the world is flat • Today’s researchers can employ computing and data resources from anywhere, using scalable search technologies … Enough said.
  • 4. Data as a key resource, Big Data • Data is increasingly recognized as valuable beyond its initial use: – – – – Data reproducibility Re-analysis Reference Data Data mining/machine learning/… – NSF Data plan requirement – Paper publication with data requirements – Community and institutional collections growing
  • 5. Data Publication today • Data cited in papers (to limited depth) • Project file archives (large, limited description, gray/dark) • Reference/analytical data (standardized content, limited breadth) • Historical collections (temporal breadth, limited numbers) - do any of these solve the problem?
  • 6. Researchers think, and work, like this: • Multi – Disciplinary – Format – Model – Semantics – Location
  • 7. and this – – – – – – Raw and derived data ~5 levels of quality, processing, maturity Observations, calibrations, experiments, models, statistical ensembles, … Also organized by location, time, variables, technique, creator, project, provenance, … Large amount of reference information from external sources (e.g. NASA) Evidence for ‘nonorthogonal’ subcollections
  • 8. What’s Really Needed? Scalable Research Productivity Requires: • A way to – store what you want – Reference what you want – Organize how you want (search, filter, tag, collect) • At the scale, and level of detail/richness, you want • When you figure that out • In a way that is self-describing/high-fidelity across applications and owners • In the vocabularies and formats you find efficient • Beyond the lifetime of individual/project interest • For active use and external credit • With minimal training/IT support required.
  • 9. How can we approach magic? • Global identifiers – data, terms, metadata • Content management abstractions (blob + type + metadata) • Service architectures and automated processing (conversion, preview, extraction, derivation, cataloging, …) • Applications that share these abstractions – write what you know, display/ignore what you don’t • Research Object management (structured, interrelated collections) Web 2.0, Web3.0, + explicit context management …
  • 10. SEAD: Sustainable Environment Actionable Data • An NSF DataNet project started in October, 2011 • An international resource for sustainability science • A provider of light-weight Data Services based on novel technical and business approaches: – Supporting the long-tail of research – Enabling active and social curation – Providing integrated lifecycle support for data http://sead-data.net/ Margaret Hedstrom, PI Praveen Kumar, co-PI Jim Myers, co-PI Beth Plale, co-PI
  • 11. SEAD is: • Data discovery • Project workspaces • A data-aware community network • Curation and preservation services that link to multiple archives and discovery services
  • 12. SEAD is: • An active repository that creates data pages with – – – – – – – – Previews Extracted Metadata Overlays Tags Comments Provenance Use information Download/Embed
  • 13. SEAD is: • A tool for community exploration: – Personal and Project Profiles – Publications and Data Citations – Co-author, co-investigator graphs – Temporal analysis
  • 14. SEAD is: • Curation and Preservation Services: – Research Object management – ID assignment – Matchmaking to long-term repositories Citation Generation – Catalog Registration SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for long– Discovery services term storage in SEAD-managed storage or external institutional repositories and cloud data services.
  • 15. – – – – Apps read what they need and write what they know Curation snapshots meaningful Research Objects Multiple ROs can be defined/managed re-using the same underlying ‘living’ content The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level Flickr-style web management of data Sensor data Semantic Content Middleware over Scalable File System and Triple Store Geospatial, social network mash-ups, workflows and services Curation Services to harvest and package specific data sets Federation of OAI repositories for long-term preservation
  • 16. Key Points • Research Objects have meaning/value but data comes in smaller chunks • Research Objects are not orthogonal, but individual data sets/files are • Lifecycle approaches for datasets are becoming possible • Managing intermixed ROs is the problem that needs to be tackled to meet the research community’s needs • Research Data Alliance (RDA) can help drive standardization/scaling
  • 17. What will drive research data preservation? • The most valuable data service(s) are active/actionable research service(s)… – The ability to define Research Objects is more important than any given RO • Led by research organizations as part of their long-term mission? – The only organizations with the focus, scope, and scale to solve the whole problem (end-to-end research productivity)
  • 18. Acknowledgements • SEAD Team @ UM, UI, IU • NSF • NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability researchers • and Thank You! … stop by the SEAD booth and share your thoughts! http://sead-data.net/