Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Digitised collections: Toward a digital strategy for for the NHM, London

599 views

Published on

Presented by Vince Smith at the pro-iBiosphere meeting in Berlin, 21-23 May 2013.

Published in: Technology
  • Be the first to comment

Digitised collections: Toward a digital strategy for for the NHM, London

  1. 1. Digitised collections:Toward a digital strategy forfor the NHM, LondonVince SmithWorkshop 3, pro-iBiosphere, Berlin23 May 2013
  2. 2. Digital Ambition: NHM Science Strategy 2013-2017A New Voyage of DiscoveryThree Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagementFive Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skillsResources & fundingMeasuring success
  3. 3. data.nhm.ac.uk/globe/
  4. 4. A New Voyage of DiscoveryThree Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagementFive Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skillsResources & fundingMeasuring successDigital Ambition: NHM Science Strategy 2013-2017Scientific impact 1,000 papers in leading journalsDigital access 20M specimens available digitallyEngagement 1M face-to-face engagementsCollections Globally important collectionsDiagnostic tools Diagnostic tools for key groupsDeep time Timeline of key transitionsScience & society Articulate of the role of scienceUK network Act as a national museumEarth sciences Earth Sciences CentreFunding £10M for Five Challenge Areas
  5. 5. Overview1. Existing digital content, sources & formats• Research data• Collections data2. Making collections data digital• Priorities• Protocols & pathfinder activities• Crowdsourcing transcription3. Aggregation & delivery• The NHM data portal• Data visualisation, data sub-portals4. Identifiers, links & interoperability• DataCite DOIs• Third party aggregators• Portal API’s, download & analytical functions5. Timeline & constraints• Data policies• Next stepsDigitisationactivitiesDataportal
  6. 6. NHM Research Outputs• 49 papers, 45 available online(4 print only or behind pay walls)• 9 had supplementary data files• 39 papers with tables, charts & other datao >1000 sequenceso 826 figureso 76 tableso 1 genome• No collective view of these data (37 journals)• No consistent way of citing NHM data• No consistent mechanism to access data• Effectively invisible at the institutional levelOne Month of NHM Science group papersData via Carolyn Lowry e-mail, 13th Feb. 20131. Existing digital content
  7. 7. NHM Collections Outputs: data• Huge investment in NHM collection management system• ≠ Imaging• Most research projects need spatio-temporal records• Different requirements for different purposesNHM COLLECTIONS April 2013Collection areaEstimate no ofspecimensNo. records indatabase% collection indatabase% records withlocation infoBotany 6,000,000 626,000 ~ 10% 96%Entomology 32,000,000 316,000 <1% 68%Mineralogy 500,000 422,000 ~ 95% 79%Palaeontology 9,000,000 342,000 ~ 3% 89%Zoology 28,000,000 1,131,000 ~ 60% via lots) 69%TOTAL 76,000,000 2,837,000 3% (23% )1. Existing digital content
  8. 8. • Many, many imaging projects (highly fragmented)• Circa 40 TB for major collections (excluding library)• 120,000 images in KE EMu (many others not in KE!)• Circa 250,000 via NHM Photo unit (limited metadata)Collection area No. image files Disk spaceBotany 140,133 35,302Entomology 529,106 3,172Mineralogy 14,000 6Palaeontology 122,548 993Zoology 12,975 1,598TOTAL 818,762 41,070NHM Collections Outputs: images1. Existing digital content
  9. 9. Current data formats• Darwin Core Archive (DwCA) & extensions (collections)• Circa 2020 fields mapped to 50 fields to generate archive• Images mainly JPG & TIFF• Metadata using EML & Genesis II standard• Research data files in a wide array of formats (blob files)Nexus (character data and Newick formattedphylogenetic trees)Non-NHM specimen lists (as Darwin CoreArchive files)PhyloXML (an XML standard for representingphylogenetic trees)Output from the Imaging and Analysis Centre(Micro CT datafile formats)NeXML (an XML standard for representingcharacter data)Collections of images from digitisation projects(as a collection of links or a zipped archive)Sequence trace files (.scf sequencechromatogram format files)Environmental sequence filesTaxon checklists (as Darwin Core Archive files) Collection level descriptions1. Existing digital content
  10. 10. • Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)2. Making collections data digitalDigitisation Priorities
  11. 11. • Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)• Linked to strategic collaborations & financial opportunitieso e.g RBG Kew, RBG Edinburgh, Nat. Mum. Wales, Hunterian etc.• Priorities dictate order – we plan to do it all (eventually)!2. Making collections data digitalDigitisation Priorities
  12. 12. • Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)2. Making collections data digitalDigitisation Protocols
  13. 13. • Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)• Pathfinder activities for less well understood projectso Entomological dry material (30 M specimens)- iCollections (specimen-by-specimen) approach- SatScan (drawer level multi-specimen) approach2. Making collections data digitalDigitisation Protocols
  14. 14. • Specimen-by-specimen, traditional, dedicated 6 person team• Digitising British Isles Lepidoptera collection• ~500,000 specimens, 5,000 drawers• Re-curation & specimen imaging• Complete label information including georeferencing• For use in Climate Change initiative2. Making collections data digitaliCollections Initiative
  15. 15. • 4-6 people over 3 years, work broken into small tasks by teams• Average imaging rate 163 specimen/day*person• Averaging >3min per specimen (prep., imaging & databasing)• >£1/specimen• BUT: 6,800 person years for the entire collection2. Making collections data digitaliCollections Initiative
  16. 16. • Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated2. Making collections data digitalSatScan Initiative
  17. 17. • Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated2. Making collections data digitalSatScan Initiative
  18. 18. • Dedicated specimen-level rapid annotation software2. Making collections data digitalSatScan Initiative
  19. 19. Crowdsourcing & Transcription• We have a massive transcription problem• Experiments via Notes-from-Nature (a Zooniverse project)• Transcribing the NHM ornithological accession registers• Wikimedian in Residence (Wikisource transcription)• 4 Month project, including specimen label transcription2. Making collections data digital
  20. 20. data.nhm.ac.uk• A focus for deposition and discovery of major NHM data sets• Promote innovation though re-use of museum data• Open Access, at a dedicated subdomain of the NHM website• Started Jan. 2013 (3 years), consultation throughout 2012NHM Data PortalFunctionalcomponentsof the dataportal3. Aggregation & Delivery
  21. 21. SearchDatasetsmatchingcriteriaIndividual datasetResultsBrowse &searchcriteriaAdvanced displayoptions• Dataset registry, for dataset discovery, modeled on data.gov.uk• Uses CKAN, an open-source data portal software platform3. Aggregation & DeliveryNHM Data Portal: Registry
  22. 22. Metadataabout thedatasetNameGeographicscopeTags“Social”AuthorsLicenseDownloadDeveloper toolsTechnicalInfo.(extractedfrom datafile)• Dataset metadata discovery3. Aggregation & DeliveryNHM Data Portal: Registry
  23. 23. • Simple datasets upload workflow for non-collections data1. Name thedataset 2. Upload / linkthe data file3. Describe thedata file4. Theme &tag5. Add additionalresources6. Temporalcoverage7. Geographiccoverage8. Save & finish3. Aggregation & DeliveryNHM Data Portal: Dataset upload
  24. 24. ZoomablemapAppliedfiltersToggle map, table &stats viewsSearch, download &display optionsNo. recordsNo.Georef.records• Dedicated interface to visualise & explore major datasets• Focused on collections data, based on Canadensys.net, uses CartoDB3. Aggregation & DeliveryNHM Data Portal: Data visualisation
  25. 25. Collections viewsStatisticalsummarySpecimen record viewsData fieldmappingsSummarypreviewFullrecordTablesDownload3. Aggregation & DeliveryNHM Data Portal: Data visualisation
  26. 26. • Using DataCite DOIs in the data portal• datasets (2014) & specimens (2015)• Unique, persistent and resolvable identifiers• Easy to cite, alias existing specimen identifiers• Conform to minimum DataCite requirements• Landing page, min. metadata standard, fee, min. 10 yr. contract, DOI (pre)fixesNHM Data Portal & DataCiteBreaks us out of the biodiversity data silo4. Identifiers, links & interoperability
  27. 27. • Content within the NHM data portal will be highly accessibleo Collections harvestable (e.g. by GBIF as a DwCA)o Download DwCAs on any search faceto Wide set of API’s available of datasets (part of CKAN)• Sub-portals (selected content, themed by topic)o e.g Virtual Herbarium, NHM Science initiatives, geographic regions• Analytical interface planned for 2015 (but not specified)Data Aggregation, APIs & download4. Identifiers, links & interoperability
  28. 28. • Data portal will be “open-by-default”• Ambiguity in what this means & top down schizophrenia• Conflicting mandates on open access & revenue opportunities• Lots of guidance available, will use to form a common policy• A cross institutional policy would be useful (but challenging)Data Policies & Next Steps5. Timeline & constraints
  29. 29. Jan 2013 Jan 2014 Jan 2015 Jan 2016Requirements& dataset discoveryPrivate alpha Stable publicbetaFull release &sub-portalsInternal feedback, datavisualisation & DOIsSubportals &analytical toolsProject startNHM Data portal timelineNext 6 months• More documentation (PID and Tech Spec)• Consultation and advocacy (internal and external)• Data mapping from KE EMu and software testing• Developmento website wireframe designo drafting data visualisation subcontracto Construction of private alpha release5. Timeline & constraintsData Policies & Next Steps
  30. 30. Jan 2013 2014 2018Path-finding &ProgrammedevelopmentPrivate alpha Stable publicbeta20 Million!!Project startNHM digitisation timelineNext 6 months• Initial conclusions from path-finding digitisation activities• Initial grant funding bids developed• Advocacy, outreach & development of a digitisation “programme”• Investigate possibilities for gallery development• Develop crowdsourcing strategy2015 2016 2017Major fundingapplications &a new gallery?Digitisie… Digitisie… Digitisie…5. Timeline & constraintsData Policies & Next Steps
  31. 31. QUESTIONS
  32. 32. Digitisation Priorities• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.0100200300400500600700Crop Wild Relatives (accepted taxa only)2. Making collections data digital
  33. 33. • Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.• Tiered approach, different needs for different collectionsNick Poole, UK Collections Trust2. Making collections data digitalDigitisation Priorities

×