• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
2013 02 data portal science group update -v smith
 

2013 02 data portal science group update -v smith

on

  • 532 views

 

Statistics

Views

Total Views
532
Views on SlideShare
532
Embed Views
0

Actions

Likes
2
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    2013 02 data portal science group update -v smith 2013 02 data portal science group update -v smith Presentation Transcript

    • data.nhm.ac.ukNHM data portal updatePart of the informaticsinitiative (2013-15)Vince Smith & Ben Scott
    • The problem – research data Hard to find, access, cite and integrate • 45 available online (4 print only or behind pay walls) • 9 had supplementary data files • 39 papers with tables, charts & other data o>1000 sequences o826 figures o76 tables o1 genome • No collective view of these data (37 journals) • No consistent way of citing NHM data • No mechanism to integrate or version • No way to repurpose data (retyping?) 49 NHM science group papers in last 4 weeks Data via Carolyn Lowry e-mail, 13th Feb. 2013
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32 Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40 Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36 Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55 Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34 Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections •23 interfaces & datasets of varying importance
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections •23 interfaces & datasets of varying importance •No priority to collection datasets119 Specimens Up to 28,000,000 Specimens
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404)
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404) •Library doesn’t have any online collections!
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404) •Library doesn’t have any online collections! Bigger issues •Idiosyncratic browse or search
    • The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404) •Library doesn’t have any online collections! Bigger issues •Idiosyncratic browse or search •No maps, few images & very slow •No summary or statistics •No download, export or custom views •No integration with other data •No author info or update info •No means of specimen citation The data portal must •No exports to GBIF or associated projects correct these issues
    • The solution – data.nhm.ac.uk portal High level issues Functional requirements •A central access point for NHM research & collections data •The capacity store/link and describe datasets •Integrated search & browse of datasets •The ability to cite datasets and specimen records in data sets •The ability to integrate collections data •Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium) •The capacity to download, export & analyse data Principles •Open-by-default: Capacity for embargoed and private data •Sustainable: Self-populated by NHM staff (except collections data) Exclusions •Not a replacement for DAMS or KeEMu (a Web interface for these systems) •Publications out of scope (focused on data sets) •All annotations on data link back to the source (e.g. KeEMu)
    • The solution – data.nhm.ac.uk portal System Overview Scope File types Registry Subportals (Source Data) (formats) (Discovery & download) (Branded slices of data) KeEMu (NHM) Subportal 1 Other e.g. Disease initiative HerbCat (Kew) NHM specimens DwC-A PhyloXML neXML Subportal 2 Nexus e.g. Kew / NHM Excel, CSV Other datasets etc… Kew specimens Virtual Herbarium Species dictionary, initiatives, Scratchpads etc Private User contributed Explorer datasets Map view Table view Statistics view Analytic view R
    • Portal overview – adding data sets Quick & easy, semi-automated workflow 1. Name the dataset 2. Upload / link the data file 3. Describe the data file 4. Theme & tag5. Add additional resources 6. Temporal coverage 7. Geographic coverage 8. Save & finish
    • Portal overview – search interface Discovering research data sets Results SearchBrowse & Datasets search matching criteria criteria Individual Advanced dataset display options
    • Portal overview – data set display Exploring research data sets License Name Authors Tags DownloadMetadataabout the dataset Technical Info. (extracted from data file) Geographic Developer “Social” scope tools
    • Portal overview – collections data Main interface Toggle map, table Search, download No. records & stats views & display options No.Georef.recordsZoomable Applied map filters
    • Portal overview – collections data Additional interfaces Collections views Specimen record views TablesStatisticalsummary Full record Summary Data field Download preview mappings
    • Portal overview Some example data portals & softwareData.gov & CKAN•UK government data portal•Uses CKAN, open-source data portal platform•Used by national & regional governments•Links into Drupal, DataCite & NHM systems•http://data.gov.uk & http://ckan.org/Canadensys & CartoDB•Canadian network of biodiversity collections•Almost 1 million specimens, 18 datasets•Uses CartoDB mapping solution•Create dynamic maps, analyze and build locationaware and geospatial applications•Widely used, cloud data storage, PostGIS•http://data.canadensys.net & http://cartodb.com/
    • Portal development Timeline & resourcesYear 1 – Dataset discovery•Technical & functional specification (Vizz. subcontract)•Data workflows (KeEMu & research datasets)•Functional alpha prototype (CKAN)Year 2 – Visualisation•Mapping & statistical functionality (CartoDB)•Social and annotation functions•Stable beta release at http://data.nhm.ac.ukYear 3 – Citation & analysis•DataCite DOIs on datasets & specimens•Initial Web analytical functions (R)•Initiative sub-portals including Virt. HerbariumResources•1x Developer (Ben Scott) for 3 years•Vizzuality subcontract (circa £xxk - TBC)•ICT capital, travel & software (circa £25k)
    • Portal consultation Feedback & next stepsDocumentation•Overview specification - http://goo.gl/qjioh•Project Initiation Document - http://goo.gl/oRr2jInitial stakeholder meetings (Feb. – May)•ICT Group (David Thomas, Chris Sleep & Gavin Malarky)•Darrell Siebert and the KE EMu user group•NHM Collections Committee & Initiative leaders•Kew Gardens & Virtual Herbarium Reps.•GBIF, NBN, UK DataCite team at BL, NERC•Digital Facility Team•Vizzuality FEEDBACK & LINKSWider consultation Slides:•Example data types / sets Feedback: vince+portal@vsmith.info•Specialist search options & vocabularies Specification: http://goo.gl/qjioh•Specialist Earth Science needs PID: http://goo.gl/oRr2j
    • Two more thingsWikipedian in Residence•Four month post with Science Museum•Starting March / April•Work with NHM staff to improve Wikipedia•Run events with NHM staff & volunteers•Work with the GLAM group at Imperial College•Focus on NHM science themes & specimens•Not about promotion of “The NHM”Biodiversity Informatics Workshop – May 2013•One full day - date TBC•Outputs from ViBRANT & e-Monocot•Includes Scratchpads & the Biodiversity Data Journal•What we do, how its used and where are we going•Includes links to NHM informatics & digitisation initiatives
    • Portal overview – data citation Unique identifiers for datasets & specimen recordsWhy cite data•URLs are not persistent•e.g. Wren JD: URL decay in MEDLINE- a 4-yearfollow-up study. Bioinformatics. 2008, Jun1;24(11):1381-5) – circa 40% decay•Measure our digital footprint•Puts research data on par with articles•Facilitates data miningWhat gets an identifier•NHM specimen records (suffix of NHM ID’s) http://dx.doi.org/BMNH_•NHM research datasets (files) PBI_00388325•Insert into publicationsHow to cite data•Digital Object Identifiers (DOIs)•Widely used & understood on articles•Operates in collaboration with DataCite•Part of an International consortium•Mixes NHM data with other domains