Advertisement

NHM Data Portal: first steps toward the Graph-of-Life

Informatics Research Leader at the Natural History Museum, London
Aug. 5, 2016
Advertisement

More Related Content

Slideshows for you(20)

Similar to NHM Data Portal: first steps toward the Graph-of-Life(20)

Advertisement

More from Vince Smith(20)

Advertisement

NHM Data Portal: first steps toward the Graph-of-Life

  1. NHM Data Portal: first steps toward the Graph-of-Life Vince Smith, Ben Scott & Ed Baker Informatics & Digital Collections Group, NHM London SPNHC, Berlin, 23 June 2016
  2. NHM Collection Collection area No of objects No of type specimens Physical register Digital data Palaeontology 6,919,207 43,146 2,364,232 340,636 Mineralogy 423,563 615 425,000 402,727 Botany 5,863,000 172,750 127,200 645,222 Entomology 33,753,257 612,796 57,197 255,000 Zoology 27,501,350 325,000 1,986,000 1,160,216 Library & archives 5,460,000 - - - TOTAL 79,920,377 1,154,307 4,959,629 2,803,801 <3% of NHM specimens are digitised, & even fewer are ‘computable’
  3. Citizen science Big, open, linked dataHigh-throughput digitisation Data portal and tools Text mining Robotics Digital Science at the NHM
  4. Citizen science Big, open, linked dataHigh-throughput digitisation Data portal and tools Text mining Robotics Digital Science at the NHM
  5. NHM Digital Collections Access, pre-2015 • Developed with the best of intentions, but… • 23 separate interfaces • Hard to find, cite, access and integrate • No maps, few images, slow, no statistics, no export, few updates, no authors, no citation mechanisms, no GBIF connection
  6. NHM Data Portal • Discovery of NHM collections & research data • Easy access & reuse to promote collaboration (website, API, R-package, RDF & direct download) • 3.7m records, >1m images (+sound, video & 3D) • Integrates with our collection management system (weekly) & DAM system (for images) • Traffic light data quality indicators • Stable, citable (DataCite) identifiers on datasets & GUIDs on records to measure impact • Technically sustainable & scalable • Default open licensing (CC-Zero, CC-BY, CC-BY-NC) http://data.nhm.ac.uk
  7. CKAN – the technical foundation for the portal • Enterprise, open source data portal platform • Developed by Open Knowledge Foundation • Used by 31 national governments, 74 regional authorities, academia & large commercial organisations • Key features o Publish & find datasets o Store & manage large data o Robust API o Customise & extend o Sustainable http://ckan.org/e.g. http://data.gov.uk/
  8. Primary views of each NHM dataset Point map Grid map Heat map Statistical overviewFilterable table
  9. Dataset & data record citation • DataCite DOIs on every dataset • Stable URI (UUID) on every record • Prior identifiers aliased & disambiguated • Citation encouraged with clear statements at dataset & record level • Allows us to track cited usage • Dynamic DOI’s on subsets coming soon Dataset DOI Specimen URI
  10. Traffic-light data quality indicators (via GBIF) Via GBIF API Major errors Minor errors No errors nb. similar services offered by CRIA for Brazilian data
  11. Potential errors highlighted & “corrected”
  12. Assembly Video doi: 10.3897/zookeys.481.8788 Step-by-step instructions Supports deposition of other research datasets
  13. Easy addition of new datasets (rapid & semi-automated) 1. Name the dataset 2. Upload / link the data file 3. Describe the data file 4. Theme & tag 5. Add additional resources 6. Temporal coverage 7. Geographic coverage 8. Save & finish
  14. Data access & feedback Extensive API R integration Link to data curator team DwCA Downloads RDF (Linked Open Data)
  15. Serving external data aggregators GBIF iDigBio EOL Vertnet CRIA
  16. Data visualisations driven by API DEMO DEMO DEMO
  17. 500,000,000 (since Feb. 2015, excluding major aggregators) Records downloaded
  18. Data access & feedback Extensive API R integration Link to data curator team DwCA Downloads RDF (& Linked Open Data)
  19. Tim Berners-Lee, the inventor of the Web and Linked Data initiator, suggested a 5-star deployment scheme for Open Data… What does a 5-star Data Portal mean?
  20. LOD gives us the means to connect our data (i.e. graph queries across distributed datasets)
  21. Top 200 collections holding institutions contributing specimen record to GBIF Example 1: “what data are we publishing” • What proportion of our collections are accessible / digitised? • What biases exiting in our digitised collections? • How much taxonomic redundancy exists in our collections? Useful for policy setting: - Planning digitisation strategies (why should we all be digitising the same taxa first) - Identifying institutional collections strengths (outside our community these are often not known) - What is ‘unique’ in our collections (taxonomically, geospatially, temporally) - Disaster planning (how many institutions hold the same material)
  22. What collections are held globally? Where are these specimens from? There are huge gaps and biases in what & where about our collections & where these collections are from Top 200 collections (scaled by size) Specimen country origin (darker is more )
  23. Our results are very incomplete, constrained by what we’ve digitised Size of collection Proportion digitised RBGE RBGK NHM MNHN RMCA RBINS Very small proportions of our collections are digitally accessible We don’t publish the overall size of our collections in a machine readable way
  24. Example 2: exploring ecological interactions • Specimen data is one dimension of our collections • We need to know how organisms interact E.g. Predator-prey, pollinator-pollenated, host-parasite • Museums have lots of this data NHM Interactions data: • Louse-host (12,000+) • Helminth host-parasite (250,000+) • Also large datasets: Coleoptera feeding on dipterocarp seeds, butterfly host-plants, British mammal-flea associations, bee flower pollinators, several parasitic wasp datasets, …. Increasingly published as RDF via NHM Data Portal
  25. Global Biotic Interactions (GloBI) Database • By Jorrit Poelen & colleagues • Collates interaction datasets • Currently >1.9M interactions • EOL pulls these into Species Pages • NHM Portal creates a combined dataset to feed GloBI • Produces Linked Open Data – Create beautiful visualisations http://www.globalbioticinteractions.org/
  26. • Predatory interactions for Eurythenes gryllus • Visualisations highlight number, frequency & type of interaction GloBI’s Interaction Browser https://blog.globalbioticinteractio ns.org/2014/03/21/exploring- antarctic-interactions-using- globis-interaction-browser/
  27. Create beautiful visualisations with custom R scripts and existing libraries (e.g., igraph, Reol, rgdal) https://blog.globalbioticinteractions.org/201 4/06/06/a-food-web-map-of-the-world/
  28. Conclusions • Data portals like the NHM Portal allow us to contribute and reflect our data through the lens of specialist aggregators • GBIF & GloBI are specialist aggregators serving LOD • LOD allows us to combine big datasets to address new questions – Tracking interactions & distribution of disease vectors – Predicting crop pests, via the distribution and interactions of pests of crop wild relatives Next Steps • Continue Portal development & encourage institutional adoption • Consolidate NHM ecological interaction datasets • Publish combined dataset on the NHM Data Portal • GloBI to harvest the dataset and publish linked open data • Develop visualisations for key NHM datasets
  29. Acknowledgements Ben Scott – Portal Engineer & Architect Ed Baker – Data Researcher Laurence Livermore - Project Management Matt Woodburn – Data Architect Vince Smith – SRO / Coordinator

Editor's Notes

  1. Age of enlightenment -Linking historical specimens & early scientific literature (cultural) Crop wild relatives - Ranges and ecology of crop pest relatives (metadata) Informatics - Digitisation workflows, data access & tools (digital) Environmental change - Phenology of butterflies (pinned insects) Macroscience from micro-collections - Vectors, ontology & minerals (slides) Open herbarium - Global plant diversity (herbarium sheets)
  2. Age of enlightenment -Linking historical specimens & early scientific literature (cultural) Crop wild relatives - Ranges and ecology of crop pest relatives (metadata) Informatics - Digitisation workflows, data access & tools (digital) Environmental change - Phenology of butterflies (pinned insects) Macroscience from micro-collections - Vectors, ontology & minerals (slides) Open herbarium - Global plant diversity (herbarium sheets)
  3. Hard to track use. A few are beginning to cite in papers, but rates are low,
  4. So what does linked mean for us and what are the benefits: As a consumer, you can do all what you can do with ★★★★ Web data and additionally: ✔ You can discover more (related) data while consuming the data. ✔ You can directly learn about the data schema. ⚠ You now have to deal with broken data links, just like 404 errors in web pages. ⚠ Presenting data from an arbitrary link as fact is as risky as letting people include content from any website in your pages. Caution, trust and common sense are all still necessary. As a publisher … ✔ You make your data discoverable. ✔ You increase the value of your data. ✔ Your own organisation will gain the same benefits from the links as the consumers. ⚠ You’ll need to invest resources to link your data to other data on the Web. ⚠ You may need to repair broken or incorrect links.
  5. A GIANT GRAPH
  6. Back in April of this year, the National Museum of Natural History in New Delhi was destroyed. Large collections of mammals and birds were lost in that fire, but it truth as a community it is hard to assess the real impact of the loss because we don’t have a global perspective on what is in our collections. This information is only held locally.
  7. NARRATIVE: Bias in data publishers and collections. For specimen data only for top 200 publishing institutions (out of XXX specimen data publishers): Represents a total of XXX Dots: Institutions publishing specimen data to GBIF – scaled by size Background: countries specimens come from – darker is more
  8. NARRATIVE: Even the data available is very incomplete. E.g. NHM London (outer London dot) and Kew (inner London dot) combined. (Other dot is RBGE). In general not much! Circle = scaled by stated collection size. Black: proportion exposed via GBIF.
  9. Developed by Jorrit Poelen (freelance software engineer)
  10. Allows us to generate visualisations that show major interaction patterns across all interactions Here is an example: Green: plants; pink: parasitic fungi Potential Uses: Guide conservation: should ecologically unique interactions be identified and prioritized for conservation?
  11. Allows us to generate visualisations that show major interaction patterns across all interactions Here is an example: Green: plants; pink: parasitic fungi Potential Uses: Guide conservation: should ecologically unique interactions be identified and prioritized for conservation?
  12. Developed by Jorrit Poelen (freelance software engineer)
Advertisement