Your SlideShare is downloading. ×
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the Most Out of Research Support
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Sets, Ensemble Cloud Computing, and the University Library: Getting the Most Out of Research Support


Published on

A presentation given at AGU2013

A presentation given at AGU2013

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Data Sets, Ensemble Cloud Computing, and the University Library: Getting the Most Out of Research Support Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4 1 School on Information, University of Michigan, Ann Arbor, MI, United States. School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States. 2
  • 2. Overview • Technological advances are making it ever easier to move computation, data, and metadata around • With decreasing costs and increasing recognition of the value of data re-use, many organization are exploring their role in data curation/preservation • If we look at the nature of the problem – How should data be curated to scalably support research? • Lifecycle approaches to manage value-defined research objects – Can we do it? • SEAD as an end-to-end demonstration… – What organization(s) are best positioned/the most capable of leading/providing such services long-term? • Primary research organizations have a combination of capability, motivation, and long-term commitment.
  • 3. Technology – the world is flat • Today’s researchers can employ computing and data resources from anywhere, using scalable search technologies … Enough said.
  • 4. Data as a key resource, Big Data • Data is increasingly recognized as valuable beyond its initial use: – – – – Data reproducibility Re-analysis Reference Data Data mining/machine learning/… – NSF Data plan requirement – Paper publication with data requirements – Community and institutional collections growing
  • 5. Data Publication today • Data cited in papers (to limited depth) • Project file archives (large, limited description, gray/dark) • Reference/analytical data (standardized content, limited breadth) • Historical collections (temporal breadth, limited numbers) - do any of these solve the problem?
  • 6. Researchers think, and work, like this: • Multi – Disciplinary – Format – Model – Semantics – Location
  • 7. and this – – – – – – Raw and derived data ~5 levels of quality, processing, maturity Observations, calibrations, experiments, models, statistical ensembles, … Also organized by location, time, variables, technique, creator, project, provenance, … Large amount of reference information from external sources (e.g. NASA) Evidence for ‘nonorthogonal’ subcollections
  • 8. What’s Really Needed? Scalable Research Productivity Requires: • A way to – store what you want – Reference what you want – Organize how you want (search, filter, tag, collect) • At the scale, and level of detail/richness, you want • When you figure that out • In a way that is self-describing/high-fidelity across applications and owners • In the vocabularies and formats you find efficient • Beyond the lifetime of individual/project interest • For active use and external credit • With minimal training/IT support required.
  • 9. How can we approach magic? • Global identifiers – data, terms, metadata • Content management abstractions (blob + type + metadata) • Service architectures and automated processing (conversion, preview, extraction, derivation, cataloging, …) • Applications that share these abstractions – write what you know, display/ignore what you don’t • Research Object management (structured, interrelated collections) Web 2.0, Web3.0, + explicit context management …
  • 10. SEAD: Sustainable Environment Actionable Data • An NSF DataNet project started in October, 2011 • An international resource for sustainability science • A provider of light-weight Data Services based on novel technical and business approaches: – Supporting the long-tail of research – Enabling active and social curation – Providing integrated lifecycle support for data Margaret Hedstrom, PI Praveen Kumar, co-PI Jim Myers, co-PI Beth Plale, co-PI
  • 11. SEAD is: • Data discovery • Project workspaces • A data-aware community network • Curation and preservation services that link to multiple archives and discovery services
  • 12. SEAD is: • An active repository that creates data pages with – – – – – – – – Previews Extracted Metadata Overlays Tags Comments Provenance Use information Download/Embed
  • 13. SEAD is: • A tool for community exploration: – Personal and Project Profiles – Publications and Data Citations – Co-author, co-investigator graphs – Temporal analysis
  • 14. SEAD is: • Curation and Preservation Services: – Research Object management – ID assignment – Matchmaking to long-term repositories Citation Generation – Catalog Registration SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for long– Discovery services term storage in SEAD-managed storage or external institutional repositories and cloud data services.
  • 15. – – – – Apps read what they need and write what they know Curation snapshots meaningful Research Objects Multiple ROs can be defined/managed re-using the same underlying ‘living’ content The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level Flickr-style web management of data Sensor data Semantic Content Middleware over Scalable File System and Triple Store Geospatial, social network mash-ups, workflows and services Curation Services to harvest and package specific data sets Federation of OAI repositories for long-term preservation
  • 16. Key Points • Research Objects have meaning/value but data comes in smaller chunks • Research Objects are not orthogonal, but individual data sets/files are • Lifecycle approaches for datasets are becoming possible • Managing intermixed ROs is the problem that needs to be tackled to meet the research community’s needs • Research Data Alliance (RDA) can help drive standardization/scaling
  • 17. What will drive research data preservation? • The most valuable data service(s) are active/actionable research service(s)… – The ability to define Research Objects is more important than any given RO • Led by research organizations as part of their long-term mission? – The only organizations with the focus, scope, and scale to solve the whole problem (end-to-end research productivity)
  • 18. Acknowledgements • SEAD Team @ UM, UI, IU • NSF • NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability researchers • and Thank You! … stop by the SEAD booth and share your thoughts!