2. NSF DataNet Program
Motivation:
“… one of the major challenges of this scientific
generation: how to develop the new methods,
management structures and technologies to
manage the diversity, size, and complexity of
current and future data sets and data streams.”
Response:
DataNet creates “a set of exemplar national
and global data research infrastructure
organizations” to address this challenge.
4. Sustainable Environment Actionable Data
(SEAD) - DataNet
• SEAD Strategy SEAD Partners - http://sead-data.net
― Serve scientists and
researchers in the “long
tail” of science
― Leverage social media for
discovery of
data, interest, and
expertise
― Move data curation
upstream in the data life
cycle of science
― Take advantage of
existing domain and
institutional infrastructures
(Institutional
Repositories, ICPSR) for
long-term preservation
5. SEAD TEAMS
Margaret Hedstrom-PI, Ann Zimmerman-Co-PI, Karen
Michigan Woollams, George Alter (ICPSR), Bryan Beecher (ICPSR), Jude Yew
Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert
Indiana Light, Kavitha Chandrasekar, Stacy Kowalczyk, Robert Ping
James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Rensselaear
Praveen Kumar-Co-PI, Md Aktaruzzaman, Terry McLaren (NCSA), Rob
Illinois Kooper (NCSA), Luigi Marini (NCSA)
6. SEAD 18 month Pilot Phase
• Domain Engagement:
– National Center for Earth Systems Dynamics (NCESD), Illinois River
Basin Observatory
– Requirements, Use Cases, Prioritization of Data Types and Services
• Active and Social Curation
– Pilot Active Content Repository, VIVO deployments
– Exemplar services for Data Ingest, Discovery, Re-use, Curation
(Tupelo/Medici)
• CI for Long-term Access (Virtual Archive)
– Data model, protocol design/development
– Pilot Federated Repository infrastructure
• Education, Outreach, and Training
– Post-doc mentoring
– Web site, training materials, meetings, workshops, …
• Project Oversight
– Management, reporting, committees
– Business model development
8. Data challenges
• Heterogeneity of
all kinds
• Multiple scales
• Multidisciplinary
• Many small
datasets
9. The long tail of scientific research
• Small and derived data sets
• Heterogeneous data
• Multiple sources of data
• Short-lived data with long-term
value
• Value of data grows when combined
& integrated
11. SEAD notions of defined Data Phases
• Phases of data lifecycle acknowledge and accommodate
the difference between public data and data still in work by a
researcher.
• Research Data Phase: data set is research data
collection, owned by individual and under their control.
– Data need not be licensed at this time because it is not
ready for broader release
– Data need not have permanent IDs because still work in
progress
– Corresponds to first existence in Active Curation Repository
• Published Phase: Owner of research data collection
determines that dataset is ready for publication
– License terms set
– Persistent ID
– Made available as part of public profile in VIVO
– Activated by user-controlled publish event
12. SEAD CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated Curation
Data Metadata Workflow/Rule
Acquisition, A Engine
Management
nalysis and Operates on
Simulation DDI3.
Metadata, Content Objects Scholarly
METS, PREMIS, MODS, DC
, SensorML, OGC, … and Trigger Events Communication
Ingest scripts:
Ingest, AIPs
Appraisal fixity, integrity, Compound Objects - OAI-ORE
and CI Technical Approach
authentication,
VIVO/
Linked Data Active Selection transformation Digital Repository Federation
Content (OAIS compliant)
Preservation
Repository Actions
Dissemination Packages
Wide-Area File System
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
13. SEAD CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated Curation
Data Metadata Workflow/Rule
Acquisition, A Engine
A Management
standardized data
model
nalysis and Operates on
DDI3. Scholarly
and federation capability
Simulation METS, PREMIS, MODS, DC
, SensorML, OGC, …
Metadata, Content Objects
and Trigger Events Communication
over OAIS-Standard
Ingest scripts:
Institutional Repositories Ingest, AIPs
Appraisal fixity, integrity, aut
Compound Objects - OAI-ORE
and CI Technical Approach
hentication, transf
Selection ormation
SEAD Active Data Digital Repository Federation
(OAIS compliant)
Systems Preservation
Actions
Dissemination Packages
Wide-Area File System
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
15. SEAD CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
A robust, replicated distributed file
system used as a large-scaleCuration
Automated
backing
Workflow/Rule
Data Metadata
Acquisition, A store
Management
Engine
nalysis and Operates on
Simulation DDI3.
Metadata, Content Objects Scholarly
METS, PREMIS, MODS, DC
, SensorML, OGC, … and Trigger Events Communication
Ingest scripts:
Ingest, AIPs
Appraisal fixity, integrity, aut
Compound Objects - OAI-ORE
and CI Technical Approach
hentication, transf
VIVO/
Linked Data Active Selection ormation SEAD Trusted
Digital Repository Federation
Content (OAIS compliant) Preservation
Repository Actions
Dissemination Packages
Wide-Area File System
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
16. SEAD CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated Curation
Data MetadataAn Active Content Repository Workflow/Rule
Engine
Acquisition, A
nalysis and
based on standard
Management global IDs and
Operates on
Simulation DDI3.
semantic web technologies Scholarly
METS, PREMIS, MODS, DC
Metadata, Content Objects
, SensorML, OGC, … and Trigger Events Communication
- to collect and integrate
data, metadata, and provenance
Ingest scripts:
Ingest, AIPs
Appraisal fixity, integrity,
information from multiple sources.
Compound Objects - OAI-ORE
and CI Technical Approach
authentication,
VIVO/
Linked Data Active Selection transformation
DC:Creator SEAD Trusted
OPM:wasDerivedFrom
Digital Repository Federation
Content SWAN:isEvidenceFor…
(OAIS compliant) Preservation
Content
Repository Content
Actions
Content
Content Dissemination Packages
Wide-Area File System Lustre File
System
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
18. SEAD CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated Curation
Data SEAD will run a VIVO instance
Metadata Workflow/Rule
Acquisition, A Engine
nalysis and
and may harvest Linked Data from
Management
Operates on
Scholarly
Simulation other sources
DDI3.
Metadata, Content Objects
METS, PREMIS, MODS, DC
, SensorML, OGC, … and Trigger Events Communication
Ingest scripts:
Ingest, AIPs
Appraisal fixity, integrity, aut
Compound Objects - OAI-ORE
and CI Technical Approach
hentication, transf
VIVO/
Linked Data Active Selection ormation
Digital Repository Federation
Content VIVO Application: Open (OAIS compliant)
Source Preservation
Repositoryfederatable Researcher Actions
Dissemination Packages
Information –
Wide-Area File System
people, papers, projects, center
s, fields, etc.
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
20. SEAD CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated Curation
Data Metadata Workflow/Rule
Acquisition, Engine
Management
Analysis and Operates on
Simulation DDI3.
Metadata, Content Objects Scholarly
METS, PREMIS, MODS, DC
, SensorML, OGC, … and Trigger Events Communication
Ingest scripts:
Active and Social Curation AIPs
Appraisal fixity, integrity, aut
Ingest,
Compound Objects - OAI-ORE
VIVO/
Services supporting automated and
and CI Technical Approach
hentication, transf
Selection
Linked Data Active interactive use use SEAD Trusted
ormation
and interactive of SEADSEAD Federation
of Repository
Digital
Content - leveraging standard (OAIS compliant)
web Preservation
Repository Actions
application/web service toolkits and
Dissemination Packages
virtual machine infrastructure
and virtual machine infrastructure
Wide-Area File System
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
21. SEAD CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated Curation
Data Metadata Workflow/Rule
Acquisition, A Engine
Management
nalysis and Operates on
Simulation DDI3.
Metadata, Content Objects Scholarly
METS, PREMIS, MODS, DC
, SensorML, OGC, … and Trigger Events Communication
Ingest scripts:
Ingest, AIPs
Appraisal fixity, integrity, aut
Compound Objects - OAI-ORE
and CI Technical Approach
hentication, transf
Selection ormation SEAD Trusted
Active Content and Preservation Services Repository Federation
Curation Digital also
Repository (OAIS compliant) Preservation
leveraging standard web application/web Actions
service toolkits and virtual machine Dissemination Packages
infrastructure Wide-Area File System
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
22. Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated Curation
Data Metadata Workflow/Rule
Acquisition, Engine
Management
Analysis and Operates on
Simulation DDI3.
Metadata, Content Objects Scholarly
METS, PREMIS, MODS, DC
, SensorML, OGC, … and Trigger Events Communication
Ingest scripts:
Ingest, AIPs
Appraisal fixity, integrity, aut
Compound Objects - OAI-ORE
and hentication, transf
VIVO/
Linked Data Active Selection ormation Digital Repository Federation
Content (OAIS compliant)
Preservation
Repository Actions
Dissemination Packages
Wide-Area File System
Search, Browse Migration
, and Access Mechanisms and
Annotation, Vis Use, Reuse, Rep
Emulation E-Scholarship Services
ualization Tools urposing Tools Contributor User
Tools
49. SEAD Virtual Archive Architecture
VIVO
IU IR
DataCite DSpace
server
ID Server
Store data object, its metadata object, and
its relationship record (latter as RDF) in IR
as a collection
Register Obtain
DOI with DOI from register metadata to SOLR and PostgreS
VIVO DataCite for rapid retrieval of metadata
SIP
(Data+ Core
Metadata) Solr Property+
Data
SEAD Validation
Feature SIP AIP Index Domain
Ingest Extraction breakdown Metadata
(Fixity, Vir +
from Data
Client us) Data
/UI Ack
Geospatial
PostgreS +Temporal
Preservation Metadata Generation (Events) QL Index Metadata
50. Key Questions for SEAD
Prototype
• What could SEAD capture when?
• How can SEAD provide direct
value to data
producers, users, and curators?
• How can web 2.0/3.0 and social
computing lower barriers and
reduce/realign costs?
51. Towards A Shared Data Future
Data User functionalities, data
Users capture & transfer, virtual
Generators research environments
Data Curation
Data discovery &
navigation, workflow
Community Support Services
Trust
generation, annotation, interpre
tability
Persistent
storage, identification, authentic
Common Data Services ity (provenance), workflow
execution, data mining
Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010
52. Data Interoperability
• NSF OCI: DataNet and INTEROP now
DIBBs
• EUDAT
• Data Web Forum
• IETF Research Data Identifier BOF
• Upcoming Oct. US Meeting of
DataNet, INTEROP, Data Web Forum
53. Acknowledgements
SEAD is funded by the National Science
Foundation under cooperative agreement
#OCI0940824
• For more on SEAD go to:
• http://sead-data.net
• Follow us on Twitter
@SEADdatanet
http://sead-data.net
A Collection of heterogeneous files. Users can tag and add comments to the entire ‘collection’ and individually tag and comment on the objects in the collection. Note: Extraction services and previewers are all driven by the file MIME type. Extraction services are customizable and are designed to automate derived data products from the file being uploaded. Examples follow…
A Collection of heterogeneous files. Users can tag and add comments to the entire ‘collection’ and individually tag and comment on the objects in the collection. Note: Extraction services and previewers are all driven by the file MIME type. Extraction services are customizable and are designed to automate derived data products from the file being uploaded. Examples follow…
Lidar data saved as .png.The Image extraction service does the following:Creates the thumbnail and preview imageCreates an image pyramid of the image (zoom/pan large images w/o downloading entire image via the SeaDragon webapp )Extract all header information from image file to include: Exif, GPS, Interoperability, etc… Extracted data is view by clicking on the “Extracted Information” section.
A data set saved as a simple ASCII text file.- Users can preview the first 80 lines of the text file.
Preview the contents of .csv files
Simple map image User defined informationImage is part of multiple collectionsImage is tagged
3 Images (3 clicks)Standard Medici InfoScroll down to show location and annotationThis image file also contained geo location data which become visible in “Location”. Geo-location can be extracted from the image Exif data or authors can add a geo-location to any file in the repository.Note the creator tag and vivo reference.
Tif support - relatively large 71MB fileClicks…Click Zoom to enable SeaDragon to explore the details of the file via zoom and pan with mouse.Click the lower right icon to enable full screen. Use + or – key to zoom (or wheel on mouse), click image and drag to panClick lower right icon to return to embedded window in Medici
Image file that contains GPS data which is extracted by Medici as part of the upload process.
Mpeg file uploads:Extraction service creates a flash version of the file for preview.
PDF files Extraction service generates an image per page of the file. In this case a slide set from a presentation. Click ‘Pages’ to enable the slide set mode and click on the left or right arrows to navigate the pages. 2 images – click to advance slide.
3D object supportPreviewer provides multiple view options of the object which are accessible from the links above the preview.
.shp files The components of shape file get uploaded to Medici as a zip Medici saves the zip blob and the extraction service registers the contents of the shp file with GeoServerOpenStreetMap displays the contents of the zipLayers are on by default but can be turned by clicking the ‘show’ button.Opacity of layers can be varied using the opacity scale.(WIP) We plan to embed OpenStreetMap in Medici as a previewer for .shp and .kml
All layers off except Illinois Flood Zone map. Map zoomed into the Champaign region of interest.