Building a Data Discovery Network for Sustainability Science

  • 415 views
Uploaded on

This is the slidedeck for my SEAD presentation at the 3rd International VIVO Conference held on August 24, 2012 at Miami, FL.

This is the slidedeck for my SEAD presentation at the 3rd International VIVO Conference held on August 24, 2012 at Miami, FL.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
415
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • A Collection of heterogeneous files. Users can tag and add comments to the entire ‘collection’ and individually tag and comment on the objects in the collection. Note: Extraction services and previewers are all driven by the file MIME type. Extraction services are customizable and are designed to automate derived data products from the file being uploaded. Examples follow…
  • A Collection of heterogeneous files. Users can tag and add comments to the entire ‘collection’ and individually tag and comment on the objects in the collection. Note: Extraction services and previewers are all driven by the file MIME type. Extraction services are customizable and are designed to automate derived data products from the file being uploaded. Examples follow…
  • Lidar data saved as .png.The Image extraction service does the following:Creates the thumbnail and preview imageCreates an image pyramid of the image (zoom/pan large images w/o downloading entire image via the SeaDragon webapp )Extract all header information from image file to include: Exif, GPS, Interoperability, etc… Extracted data is view by clicking on the “Extracted Information” section.
  • A data set saved as a simple ASCII text file.- Users can preview the first 80 lines of the text file.
  • Preview the contents of .csv files
  • Simple map image User defined informationImage is part of multiple collectionsImage is tagged
  • 3 Images (3 clicks)Standard Medici InfoScroll down to show location and annotationThis image file also contained geo location data which become visible in “Location”. Geo-location can be extracted from the image Exif data or authors can add a geo-location to any file in the repository.Note the creator tag and vivo reference.
  • Tif support - relatively large 71MB fileClicks…Click Zoom to enable SeaDragon to explore the details of the file via zoom and pan with mouse.Click the lower right icon to enable full screen. Use + or – key to zoom (or wheel on mouse), click image and drag to panClick lower right icon to return to embedded window in Medici
  • Image file that contains GPS data which is extracted by Medici as part of the upload process.
  • Mpeg file uploads:Extraction service creates a flash version of the file for preview.
  • PDF files Extraction service generates an image per page of the file. In this case a slide set from a presentation. Click ‘Pages’ to enable the slide set mode and click on the left or right arrows to navigate the pages. 2 images – click to advance slide.
  • 3D object supportPreviewer provides multiple view options of the object which are accessible from the links above the preview.
  • .shp files The components of shape file get uploaded to Medici as a zip Medici saves the zip blob and the extraction service registers the contents of the shp file with GeoServerOpenStreetMap displays the contents of the zipLayers are on by default but can be turned by clicking the ‘show’ button.Opacity of layers can be varied using the opacity scale.(WIP) We plan to embed OpenStreetMap in Medici as a previewer for .shp and .kml
  • All layers off except Illinois Flood Zone map. Map zoomed into the Champaign region of interest.

Transcript

  • 1. Building a Data Discovery Network for Sustainability Science Robert H. McDonald Deputy Director Data to Insight (D2I) Center Associate Dean – IU Libraries Indiana University rhmcdona@indiana.edu | @mcdonald @SEADdatanet Presented at the VIVO 2012 Conference Miami, FL– August 24, 2012http://slidesha.re/Q9q8VW Available from: http://slidesha.re/Q9q8VW © Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide.
  • 2. NSF DataNet ProgramMotivation: “… one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams.”Response: DataNet creates “a set of exemplar national and global data research infrastructure organizations” to address this challenge.
  • 3. Current NSF DataNet Projects• SEAD – http://sead-data.net• DataOne – http://www.dataone.org• DataNet Federation Consortium – http://datafed.org• Terra Populous – https://www.pop.umn.edu/terra_pop
  • 4. Sustainable Environment Actionable Data (SEAD) - DataNet• SEAD Strategy SEAD Partners - http://sead-data.net ― Serve scientists and researchers in the “long tail” of science ― Leverage social media for discovery of data, interest, and expertise ― Move data curation upstream in the data life cycle of science ― Take advantage of existing domain and institutional infrastructures (Institutional Repositories, ICPSR) for long-term preservation
  • 5. SEAD TEAMS Margaret Hedstrom-PI, Ann Zimmerman-Co-PI, Karen Michigan Woollams, George Alter (ICPSR), Bryan Beecher (ICPSR), Jude Yew Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Indiana Light, Kavitha Chandrasekar, Stacy Kowalczyk, Robert Ping James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay ToddRensselaear Praveen Kumar-Co-PI, Md Aktaruzzaman, Terry McLaren (NCSA), Rob Illinois Kooper (NCSA), Luigi Marini (NCSA)
  • 6. SEAD 18 month Pilot Phase• Domain Engagement: – National Center for Earth Systems Dynamics (NCESD), Illinois River Basin Observatory – Requirements, Use Cases, Prioritization of Data Types and Services• Active and Social Curation – Pilot Active Content Repository, VIVO deployments – Exemplar services for Data Ingest, Discovery, Re-use, Curation (Tupelo/Medici)• CI for Long-term Access (Virtual Archive) – Data model, protocol design/development – Pilot Federated Repository infrastructure• Education, Outreach, and Training – Post-doc mentoring – Web site, training materials, meetings, workshops, …• Project Oversight – Management, reporting, committees – Business model development
  • 7. Sustainability Science ScienceCooperation Technology Policy Economics Poverty & Justice 7
  • 8. Data challenges• Heterogeneity of all kinds• Multiple scales• Multidisciplinary• Many small datasets
  • 9. The long tail of scientific research• Small and derived data sets• Heterogeneous data• Multiple sources of data• Short-lived data with long-term value• Value of data grows when combined & integrated
  • 10. SEAD notions of defined Data Phases• Phases of data lifecycle acknowledge and accommodate the difference between public data and data still in work by a researcher.• Research Data Phase: data set is research data collection, owned by individual and under their control. – Data need not be licensed at this time because it is not ready for broader release – Data need not have permanent IDs because still work in progress – Corresponds to first existence in Active Curation Repository• Published Phase: Owner of research data collection determines that dataset is ready for publication – License terms set – Persistent ID – Made available as part of public profile in VIVO – Activated by user-controlled publish event
  • 11. SEAD CI Technical Approach Active and Social Curation OAIS Repository Federation Curation Boundary Automated Curation Data Metadata Workflow/Rule Acquisition, A Engine Management nalysis and Operates on Simulation DDI3. Metadata, Content Objects Scholarly METS, PREMIS, MODS, DC , SensorML, OGC, … and Trigger Events Communication Ingest scripts: Ingest, AIPs Appraisal fixity, integrity, Compound Objects - OAI-ORE and CI Technical Approach authentication, VIVO/Linked Data Active Selection transformation Digital Repository Federation Content (OAIS compliant) Preservation Repository Actions Dissemination Packages Wide-Area File System Search, Browse Migration , and Access Mechanisms and Annotation, Vis Use, Reuse, Rep Emulation E-Scholarship Services ualization Tools urposing Tools Contributor User Tools
  • 12. SEAD CI Technical Approach Active and Social Curation OAIS Repository Federation Curation Boundary Automated Curation Data Metadata Workflow/RuleAcquisition, A EngineA Management standardized data model nalysis and Operates on DDI3. Scholarlyand federation capability Simulation METS, PREMIS, MODS, DC , SensorML, OGC, … Metadata, Content Objects and Trigger Events Communicationover OAIS-Standard Ingest scripts:Institutional Repositories Ingest, AIPs Appraisal fixity, integrity, aut Compound Objects - OAI-ORE and CI Technical Approach hentication, transf Selection ormationSEAD Active Data Digital Repository Federation (OAIS compliant) Systems Preservation Actions Dissemination Packages Wide-Area File SystemSearch, Browse Migration , and Access Mechanisms andAnnotation, Vis Use, Reuse, Rep Emulation E-Scholarship Servicesualization Tools urposing Tools Contributor User Tools
  • 13. SEAD CI Technical Approach Active and Social Curation OAIS Repository Federation Curation Boundary A robust, replicated distributed file system used as a large-scaleCuration Automated backing Workflow/Rule Data Metadata Acquisition, A store Management Engine nalysis and Operates on Simulation DDI3. Metadata, Content Objects Scholarly METS, PREMIS, MODS, DC , SensorML, OGC, … and Trigger Events Communication Ingest scripts: Ingest, AIPs Appraisal fixity, integrity, aut Compound Objects - OAI-ORE and CI Technical Approach hentication, transf VIVO/Linked Data Active Selection ormation SEAD Trusted Digital Repository Federation Content (OAIS compliant) Preservation Repository Actions Dissemination Packages Wide-Area File System Search, Browse Migration , and Access Mechanisms and Annotation, Vis Use, Reuse, Rep Emulation E-Scholarship Services ualization Tools urposing Tools Contributor User Tools
  • 14. SEAD CI Technical Approach Active and Social Curation OAIS Repository Federation Curation Boundary Automated Curation Data MetadataAn Active Content Repository Workflow/Rule Engine Acquisition, A nalysis and based on standard Management global IDs and Operates on Simulation DDI3. semantic web technologies Scholarly METS, PREMIS, MODS, DC Metadata, Content Objects , SensorML, OGC, … and Trigger Events Communication - to collect and integrate data, metadata, and provenance Ingest scripts: Ingest, AIPs Appraisal fixity, integrity, information from multiple sources. Compound Objects - OAI-ORE and CI Technical Approach authentication, VIVO/Linked Data Active Selection transformation DC:Creator SEAD Trusted OPM:wasDerivedFrom Digital Repository Federation Content SWAN:isEvidenceFor… (OAIS compliant) Preservation Content Repository Content Actions Content Content Dissemination Packages Wide-Area File System Lustre File System Search, Browse Migration , and Access Mechanisms and Annotation, Vis Use, Reuse, Rep Emulation E-Scholarship Services ualization Tools urposing Tools Contributor User Tools
  • 15. SEAD CI Technical Approach Active and Social Curation OAIS Repository Federation Curation Boundary Automated Curation Data SEAD will run a VIVO instance Metadata Workflow/Rule Acquisition, A Engine nalysis and and may harvest Linked Data from Management Operates on Scholarly Simulation other sources DDI3. Metadata, Content Objects METS, PREMIS, MODS, DC , SensorML, OGC, … and Trigger Events Communication Ingest scripts: Ingest, AIPs Appraisal fixity, integrity, aut Compound Objects - OAI-ORE and CI Technical Approach hentication, transf VIVO/Linked Data Active Selection ormation Digital Repository Federation Content VIVO Application: Open (OAIS compliant) Source Preservation Repositoryfederatable Researcher Actions Dissemination Packages Information – Wide-Area File System people, papers, projects, center s, fields, etc. Search, Browse Migration , and Access Mechanisms and Annotation, Vis Use, Reuse, Rep Emulation E-Scholarship Services ualization Tools urposing Tools Contributor User Tools
  • 16. SEAD CI Technical Approach Active and Social Curation OAIS Repository Federation Curation Boundary Automated Curation Data Metadata Workflow/Rule Acquisition, Engine Management Analysis and Operates on Simulation DDI3. Metadata, Content Objects Scholarly METS, PREMIS, MODS, DC , SensorML, OGC, … and Trigger Events Communication Ingest scripts: Active and Social Curation AIPs Appraisal fixity, integrity, aut Ingest, Compound Objects - OAI-ORE VIVO/ Services supporting automated and and CI Technical Approach hentication, transf SelectionLinked Data Active interactive use use SEAD Trusted ormation and interactive of SEADSEAD Federation of Repository Digital Content - leveraging standard (OAIS compliant) web Preservation Repository Actions application/web service toolkits and Dissemination Packages virtual machine infrastructure and virtual machine infrastructure Wide-Area File System Search, Browse Migration , and Access Mechanisms and Annotation, Vis Use, Reuse, Rep Emulation E-Scholarship Services ualization Tools urposing Tools Contributor User Tools
  • 17. SEAD CI Technical Approach Active and Social Curation OAIS Repository Federation Curation Boundary Automated Curation Data Metadata Workflow/RuleAcquisition, A Engine Management nalysis and Operates on Simulation DDI3. Metadata, Content Objects Scholarly METS, PREMIS, MODS, DC , SensorML, OGC, … and Trigger Events Communication Ingest scripts: Ingest, AIPs Appraisal fixity, integrity, aut Compound Objects - OAI-ORE and CI Technical Approach hentication, transf Selection ormation SEAD Trusted Active Content and Preservation Services Repository Federation Curation Digital also Repository (OAIS compliant) Preservation leveraging standard web application/web Actions service toolkits and virtual machine Dissemination Packages infrastructure Wide-Area File SystemSearch, Browse Migration , and Access Mechanisms andAnnotation, Vis Use, Reuse, Rep Emulation E-Scholarship Servicesualization Tools urposing Tools Contributor User Tools
  • 18. Active and Social Curation OAIS Repository Federation Curation Boundary Automated Curation Data Metadata Workflow/Rule Acquisition, Engine Management Analysis and Operates on Simulation DDI3. Metadata, Content Objects Scholarly METS, PREMIS, MODS, DC , SensorML, OGC, … and Trigger Events Communication Ingest scripts: Ingest, AIPs Appraisal fixity, integrity, aut Compound Objects - OAI-ORE and hentication, transf VIVO/Linked Data Active Selection ormation Digital Repository Federation Content (OAIS compliant) Preservation Repository Actions Dissemination Packages Wide-Area File System Search, Browse Migration , and Access Mechanisms and Annotation, Vis Use, Reuse, Rep Emulation E-Scholarship Services ualization Tools urposing Tools Contributor User Tools
  • 19. SEAD Active/Social Curation Repository
  • 20. SEAD VIVO: RIS2N3
  • 21. SEAD Virtual Archive
  • 22. Faceted search(Solr-based) Facets
  • 23. Search Result
  • 24. A dataset or file looks like this
  • 25. Geospatial search(from Postgres index)
  • 26. Geospatial search results
  • 27. Login for data upload
  • 28. Upload fileFiles from Medici can also be added
  • 29. Create collection (can have multiple files)
  • 30. Upload complete
  • 31. Data ingested to DSpace (Mississippi example)
  • 32. SEAD Virtual Archive Architecture VIVO IU IR DataCite DSpace server ID Server Store data object, its metadata object, and its relationship record (latter as RDF) in IR as a collection Register Obtain DOI with DOI from register metadata to SOLR and PostgreS VIVO DataCite for rapid retrieval of metadata SIP (Data+ Core Metadata) Solr Property+ Data SEAD Validation Feature SIP AIP Index DomainIngest Extraction breakdown Metadata (Fixity, Vir + from DataClient us) Data /UI Ack Geospatial PostgreS +Temporal Preservation Metadata Generation (Events) QL Index Metadata
  • 33. Key Questions for SEAD Prototype• What could SEAD capture when?• How can SEAD provide direct value to data producers, users, and curators?• How can web 2.0/3.0 and social computing lower barriers and reduce/realign costs?
  • 34. Towards A Shared Data Future Data User functionalities, data Users capture & transfer, virtual Generators research environments Data Curation Data discovery & navigation, workflow Community Support ServicesTrust generation, annotation, interpre tability Persistent storage, identification, authentic Common Data Services ity (provenance), workflow execution, data mining Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010
  • 35. Data Interoperability• NSF OCI: DataNet and INTEROP now DIBBs• EUDAT• Data Web Forum• IETF Research Data Identifier BOF• Upcoming Oct. US Meeting of DataNet, INTEROP, Data Web Forum
  • 36. AcknowledgementsSEAD is funded by the National ScienceFoundation under cooperative agreement#OCI0940824• For more on SEAD go to:• http://sead-data.net• Follow us on Twitter @SEADdatanet http://sead-data.net
  • 37. License terms• Please cite as: McDonald, R.H. et. al. Building a Data Discovery Network for Sustainability Science. 3rd International VIVO Conference, Miami, FL, 24 August 2012. Available from: [http://slidesha.re/Q9q8VW]• Thanks to Margaret Hedstrom, who’s guided the team through the (really) lengthy review process and to Jim Myers, Beth Plale, Praveen Kumar, Terry McLaren, Luigi Marini, Kavitha Chandrasekar and others who provided content for this presentation.• The concepts and software being leveraged in SEAD represent the work of a broad range of people over multiple years – their contributions have been critical to launching SEAD.• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.• This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.