Your SlideShare is downloading. ×
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value


Published on

A presentation given at AGU 2013

A presentation given at AGU 2013

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4 1 School on Information, University of Michigan, Ann Arbor, MI, United States. School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States. 2
  • 2. Outline • • • • • Quick Project Intro What is SEAD? (Stop by the SEAD booth!) Why is SEAD? How does SEAD work? Future active and social curation work
  • 3. SEAD: Sustainable Environment Actionable Data • An NSF DataNet project started in October, 2011 • An international resource for sustainability science • A provider of light-weight Data Services based on novel technical and business approaches: – Supporting the long-tail of research – Enabling active and social curation – Providing integrated lifecycle support for data Margaret Hedstrom, PI Praveen Kumar, co-PI Jim Myers, co-PI Beth Plale, co-PI
  • 4. Sustainability Research • Central to solving many of society’s most critical challenges • An exemplar of modern research – – – – Local processes aggregating to produce global consequences Multiple time scales Coupling of natural and human systems Interacting systems-of-systems requiring multidisciplinary understanding • Environmental – Economic - Social Science Cooperation Technology Policy Economics Poverty & Justice
  • 5. SEAD is: • Data discovery • Project workspaces • A data-aware community network • Curation and preservation services that link to multiple archives and discovery services
  • 6. SEAD is: • Secure project spaces where teams can: – Gather reference data – Upload and share new results – Annotate – Relate – Organize – Publish Project Dashboard
  • 7. SEAD is: • An active repository that creates data pages with – – – – – – – – Previews Extracted Metadata Overlays Tags Comments Provenance Use information Download/Embed
  • 8. SEAD is: • A tool for community exploration: – Personal and Project Profiles – Publications and Data Citations – Co-author, co-investigator graphs – Temporal analysis
  • 9. SEAD is: • A way to preprint and publish data: – Branded interface – Discovery metadata – Drill-down • Sub-collections • Data Pages – Submit for curation and preservation The National Center for Earth Surface Dynamics ~1.6 TB, 450K files (2.2 M objects) representing 10 years of research by multiple teams
  • 10. SEAD is: • A community platform for reference data: – Research Object management – Inference – Curation – Preservation – ID assignment – Catalog Registration – Discovery – Citation Generation SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for longterm storage in SEAD-managed storage or external institutional repositories and cloud data services.
  • 11. – – – – Apps read what they need and write what they know Curation snapshots meaningful Research Objects Multiple ROs can be defined/managed re-using the same underlying ‘living’ content The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level Flickr-style web management of data Sensor data Semantic Content Middleware over Scalable File System and Triple Store Geospatial, social network mash-ups, workflows and services Curation Services to harvest and package specific data sets Federation of OAI repositories for long-term preservation
  • 12. Why is SEAD needed for curation? • The nature of modern research • The nature of the data documentation problem • Artificial limitations derived from historical practice Unless these issues are addressed (in addition to sheer scale), data curation will remain too cumbersome and expensive for ubiquitous use…
  • 13. Data Challenges in Sustainability Research • Many dimensions, many coordinate systems, many scales, many formats, a long-tail of providers and users, … • Managing this data is a drag on productivity…
  • 14. The Long Tail in Research • Individuals/small groups where: – Scale of research prohibits traditional CI development, dedicated IT support, full-time curator… – shared data but multiple disciplinary views – Projects involve reference data from external sources – Project Team does not control formats and vocabularies These are not just “challenges for the “future
  • 15. Analyzing the curation/preservation problem… • Data and Metadata are known well during the project • Producers actually memorize or record metadata already, and then spend precious time transferring that between people and systems • Data users manually assemble missing data/metadata but don’t often have a way to share that with others • Repositories struggle to attain the domain understanding needed to go beyond basic bibliographic info – Repositories only use metadata to help with data discovery and internal curation decisions Producers Users Bill Michener – DataONE Jim Myers - SEAD Who knows what? When do they know it? Why will they tell you?
  • 16. Our collective legacy • • • • Data can only be in one place… Data transfer is costly… Mistakes are costly… Only the future needs well-organized data  (questionable assumptions) • Curation only happens at data/project/center end-of-life • Submission events must be formal and complete • Only cross-trained professionals are capable of getting it right • Researchers should see curation only as a public service
  • 17. What’s different for users? • When you add a file: – You can get it back, from anywhere – You can see your video, zoom in on images, overlay spatial data on maps and retrieve them from an OGC service endpoint – You see the metadata hidden in the file – You can add titles, descriptions, locations, tags later, not as required parts of a long submit form, and • When you do, they are search terms and ways to create custom maps – You can add good data and bad, and figure out which data to keep later (using provenance to guide you) – Users of your data can add metadata, comments, and derived datasets that improve quality, adapt the data for new purposes, etc.
  • 18. What’s different for curators • Curation starts with data and metadata in hand, not as a search through dusty disks • Curators can embed with project teams • Data comes with – Formal metadata (dc:creator= ) – Informal metadata (,2008:/tag#bpnm) – Context! (“bpnm” in the WSC_Reach project always means “Birds Point/New Madrid”) – Producers and users – conversations are possible • Packaging, repository selection, submission, registration with catalogs are all automated/semiautomated…
  • 19. SEAD Concept • Leverage incremental, informal active use to capture data and metadata from first sources • Provide data-related (metadata-driven) services to active producers and users of data • Simplify and automate curation and preservation processes using captured information and context • Leverage existing institutional repository technologies and organizations to provide longterm storage Increase Value, Lower Costs, Increase Immediacy
  • 20. SEAD is: • Write once, re-use • Extensible (data, metadata) – within sustainability research and beyond • Incremental • Living datasets  published Research Objects • Scalable A tool for data producers and users… that also provides a long-term data plan… that can be sustainable at community scale
  • 21. How? • Web 2.0, Web 3.0… • Strong collaboration with researchers and curators • Leveraging standards – vocabularies, service endpoints, transfer protocols, submission packages, … • Leveraging existing software – Medici/Tupelo, VIVO, DataConservancy + Jena, GWT, Geoserver, MySql, Fuseki, …
  • 22. Current Status • 10 hosted project spaces for pilot groups on VM farm + community VIVO, VA servers – ~< 2 TB, ~1800 profiles, proof-of-concept submissions to UI and IU institutional repositories • 1.0 OSS release in November, operating as a DataOne production Member Node (next week) – Google sign-in, cybersecurity and usability enhancements, data-maturity-based access control, dashboard, public discovery, and geobrowse interfaces, … • Project info: • Demo Space:
  • 23. Going forward – Version 1.0 released – Open early adopter period – Improving scalability – Exploring social feedback mechanisms to further improve curation – add value, remove costs, engage producers, users, and curators – Active outreach: Use SEAD! (software or services), Extend SEAD!, collaborate with SEAD!
  • 24. Acknowledgements • SEAD Team @ UM, UI, IU • NSF • NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability researchers • and Thank You! … stop by the SEAD booth and share your thoughts!
  • 25. SEAD: Components/ Communications HTTP Links/ Embedded Content SEAD VIVO: Browse Through People , Projects, Publications, Data Citations , and Organizations, Visualize Networks and Community Dynamics Main Website: Overview, Project Info, Services, Documentation, News SPARQL Queries HTTP Data/DOI links Active Content Repository (multiple webapps): Branded Public Access Active Project Spaces Individual Data Pages BAGIT Data/ Metadata Transfer SEAD Virtual Archive: Policy Driven Curation Institutional/Cloud/Grid Storage Faceted Search for Reference Data
  • 26. Web Application User Management Data/ Metadata Mgmt Desktop Drop Box Android Upload Branded Repository Geo-webapp Project Summary Web Service APIs Role-based Access Control Extractors and Indexing Tupelo 2 RDF + Files MySQL Search Page Admin Page Map Page Tag Page Collection Pages Data Pages SEAD Active Content Repository Architecture Lucene Modified/ Configured Medici/Tupelo 2 Components Geoserver Local File System SEAD ACR Additions and 3RD Party Components
  • 27. Temporal Visualization Network Visualization Data Citations Organizations Publications Projects People SEAD VIVO Architecture Input Form/Display Generation Internal APIs User Management Joseki/Fuseki/Web Services Entity Management Analytics Jena/RDF MySQL Local File System
  • 28. Geo-spatial Search Facet Search Matchmaking Ingest Processing Curator’s Workbench SEAD Virtual Archive Architecture Web Services APIs Metadata Extraction/ Persistent Identifier/ Indexing/Archival (Adapted DC Workflow) Solr Matchmaker/ DataONE Geospatial BagIt Query Member Query Repository Conversion (XML) Management Node Service Service Solr Indexer PostGIS SWORD Local File System UIUC Ideals IUScholarworks Archival Storage