Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value

Changing the Curation Equation:
A Data Lifecycle Approach to
Lowering Costs and Increasing Value
Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert
McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4
myersjd@umich.edu
1 School on

Information, University of Michigan, Ann Arbor, MI, United States.
School of Informatics and Computing, Indiana University, Bloomington, IN, United States.
3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States.
4 Data To Insight Center, Indiana University, Bloomington, IN, United States.
5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States.
2

Outline
•
•
•
•
•

Quick Project Intro
What is SEAD? (Stop by the SEAD booth!)
Why is SEAD?
How does SEAD work?
Future active and social curation work

SEAD: Sustainable Environment Actionable Data
• An NSF DataNet project started in
October, 2011
• An international resource for
sustainability science
• A provider of light-weight Data Services
based on novel technical and business
approaches:
– Supporting the long-tail of research
– Enabling active and social curation
– Providing integrated lifecycle support for data
http://sead-data.net/

Margaret Hedstrom, PI
Praveen Kumar, co-PI
Jim Myers, co-PI
Beth Plale, co-PI

Sustainability Research
• Central to solving many of society’s most critical
challenges
• An exemplar of modern research
–
–
–
–

Local processes aggregating to produce global consequences
Multiple time scales
Coupling of natural and human systems
Interacting systems-of-systems requiring multidisciplinary
understanding
• Environmental – Economic - Social
Science
Cooperation

Technology

Policy

Economics
Poverty &
Justice

SEAD is:
• Data discovery
• Project workspaces
• A data-aware
community network
• Curation and
preservation services
that link to multiple archives and discovery
services

SEAD is:
• Secure project spaces where teams can:
– Gather reference data
– Upload and share new results
– Annotate
– Relate
– Organize
– Publish

Project Dashboard

SEAD is:
• An active repository that creates data pages with
–
–
–
–
–
–
–
–

Previews
Extracted Metadata
Overlays
Tags
Comments
Provenance
Use information
Download/Embed

SEAD is:
• A tool for community exploration:
– Personal and
Project Profiles
– Publications and
Data Citations
– Co-author,
co-investigator
graphs
– Temporal analysis

SEAD is:
• A way to preprint and publish
data:
– Branded interface
– Discovery metadata
– Drill-down
• Sub-collections
• Data Pages

– Submit for curation and
preservation

The National Center for Earth Surface Dynamics
~1.6 TB, 450K files (2.2 M objects) representing 10
years of research by multiple teams

SEAD is:
• A community platform for reference data:
– Research Object
management
– Inference
– Curation
– Preservation
– ID assignment
– Catalog Registration
– Discovery
– Citation Generation

SEAD’s Virtual Archive allows curators to
access, assess, enhance, package, and submit
data from SEAD project repositories for longterm storage in SEAD-managed storage or
external institutional repositories and cloud
data services.

–
–
–
–

Apps read what they need and write what they know
Curation snapshots meaningful Research Objects
Multiple ROs can be defined/managed re-using the same underlying ‘living’ content
The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level

Flickr-style web management of data

Sensor data

Semantic Content Middleware
over Scalable File System and
Triple Store

Geospatial, social
network mash-ups,
workflows and services
Curation Services to harvest
and package specific data sets

Federation of OAI
repositories for
long-term
preservation

Why is SEAD needed for curation?
• The nature of modern research
• The nature of the data documentation
problem
• Artificial limitations derived from historical
practice
Unless these issues are addressed (in addition to
sheer scale), data curation will remain too
cumbersome and expensive for ubiquitous use…

Data Challenges in Sustainability Research
• Many dimensions, many coordinate systems, many scales,
many formats, a long-tail of providers and users, …
• Managing this data is a drag on productivity…

The Long Tail in Research
• Individuals/small groups
where:
– Scale of research prohibits traditional CI
development, dedicated IT support, full-time
curator…
– shared data but multiple disciplinary views
– Projects involve reference data from external
sources
– Project Team does not control formats and
vocabularies

These are not just
“challenges for the “future

Analyzing the curation/preservation
problem…
• Data and Metadata are known well
during the project
• Producers actually memorize or
record metadata already, and then
spend precious time transferring that
between people and systems
• Data users manually assemble
missing data/metadata but don’t
often have a way to share that with
others
• Repositories struggle to attain the
domain understanding needed to go
beyond basic bibliographic info
– Repositories only use metadata to
help with data discovery and internal
curation decisions

Producers

Users

Bill Michener – DataONE
Jim Myers - SEAD

Who knows what?
When do they know it?
Why will they tell you?

Our collective legacy
•
•
•
•

Data can only be in one place…
Data transfer is costly…
Mistakes are costly…
Only the future needs well-organized data

 (questionable assumptions)
• Curation only happens at data/project/center end-of-life
• Submission events must be formal and complete
• Only cross-trained professionals are capable of getting it
right
• Researchers should see curation only as a public service

What’s different for users?
• When you add a file:
– You can get it back, from anywhere
– You can see your video, zoom in on images, overlay spatial
data on maps and retrieve them from an OGC service
endpoint
– You see the metadata hidden in the file
– You can add titles, descriptions, locations, tags later, not as
required parts of a long submit form, and
• When you do, they are search terms and ways to create custom
maps

– You can add good data and bad, and figure out which data
to keep later (using provenance to guide you)
– Users of your data can add metadata, comments, and
derived datasets that improve quality, adapt the data for
new purposes, etc.

What’s different for curators
• Curation starts with data and metadata in hand, not as
a search through dusty disks
• Curators can embed with project teams
• Data comes with
– Formal metadata (dc:creator= http://vivo-vis-test.slis.indiana.edu/vivo/individual/n7732 )
– Informal metadata (http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag
tag:cet.ncsa.uiuc.edu,2008:/tag#bpnm)

– Context! (“bpnm” in the WSC_Reach project always means “Birds Point/New Madrid”)
– Producers and users – conversations are possible

• Packaging, repository selection, submission,
registration with catalogs are all automated/semiautomated…

SEAD Concept
• Leverage incremental, informal active use to
capture data and metadata from first sources
• Provide data-related (metadata-driven) services
to active producers and users of data
• Simplify and automate curation and preservation
processes using captured information and
context
• Leverage existing institutional repository
technologies and organizations to provide longterm storage
Increase Value, Lower Costs, Increase Immediacy

SEAD is:
• Write once, re-use
• Extensible (data, metadata) – within sustainability
research and beyond
• Incremental
• Living datasets  published Research Objects
• Scalable
A tool for data producers and users…
that also provides a long-term data plan…
that can be sustainable at community scale

How?
• Web 2.0, Web 3.0…
• Strong collaboration with researchers and
curators

• Leveraging standards – vocabularies, service
endpoints, transfer protocols, submission
packages, …
• Leveraging existing software – Medici/Tupelo,
VIVO, DataConservancy + Jena, GWT, Geoserver,
MySql, Fuseki, …

Current Status
• 10 hosted project spaces for pilot groups on VM farm +
community VIVO, VA servers
– ~< 2 TB, ~1800 profiles, proof-of-concept submissions to UI
and IU institutional repositories

• 1.0 OSS release in November, operating as a DataOne
production Member Node (next week)
– Google sign-in, cybersecurity and usability enhancements,
data-maturity-based access control, dashboard, public
discovery, and geobrowse interfaces, …

• Project info: http://sead-data.net
• Demo Space: http://sead-demo.ncsa.illinois.edu

Going forward
– Version 1.0 released
– Open early adopter period
– Improving scalability
– Exploring social feedback mechanisms to further
improve curation – add value, remove costs,
engage producers, users, and curators
– Active outreach: Use SEAD! (software or services),
Extend SEAD!, collaborate with SEAD!

Acknowledgements
• SEAD Team @ UM, UI, IU
• NSF
• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other
sustainability researchers
• and Thank You!
… stop by the SEAD booth and share your thoughts!

http://sead-data.net/

SEAD: Components/
Communications
HTTP Links/
Embedded Content

SEAD VIVO:
Browse Through People , Projects,
Publications, Data Citations , and
Organizations, Visualize Networks and
Community Dynamics

Main Website:
Overview, Project Info,
Services, Documentation,
News

SPARQL Queries
HTTP Data/DOI links
Active Content Repository
(multiple webapps):
Branded Public Access
Active Project Spaces
Individual Data Pages

BAGIT Data/
Metadata Transfer

SEAD Virtual Archive:
Policy Driven Curation
Institutional/Cloud/Grid
Storage
Faceted Search for
Reference Data

Web Application
User Management
Data/
Metadata Mgmt

Desktop Drop Box

Android Upload

Branded Repository

Geo-webapp

Project Summary

Web Service APIs
Role-based Access Control

Extractors and
Indexing

Tupelo 2
RDF
+
Files
MySQL

Search Page

Admin Page

Map Page

Tag Page

Collection Pages

Data Pages

SEAD Active Content Repository
Architecture

Lucene

Modified/ Configured
Medici/Tupelo 2
Components

Geoserver

Local File System

SEAD ACR
Additions and 3RD Party
Components

Temporal
Visualization

Network
Visualization

Data Citations

Organizations

Publications

Projects

People

SEAD VIVO Architecture

Input Form/Display Generation
Internal APIs
User
Management

Joseki/Fuseki/Web Services

Entity Management

Analytics

Jena/RDF
MySQL

Local File System

Geo-spatial Search

Facet Search

Matchmaking

Ingest Processing

Curator’s
Workbench

SEAD Virtual Archive Architecture

Web Services

APIs
Metadata Extraction/
Persistent Identifier/
Indexing/Archival
(Adapted DC Workflow)

Solr
Matchmaker/ DataONE Geospatial
BagIt
Query
Member
Query
Repository
Conversion
(XML)
Management Node Service Service
Solr Indexer

PostGIS

SWORD

Local File System

UIUC Ideals

IUScholarworks

Archival
Storage

Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value

Similar to Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value (20)

More from SEAD

More from SEAD (13)

Recently uploaded

Recently uploaded (20)

Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value