Digitised collections: Toward a digital strategy for for the NHM, London

  • 163 views
Uploaded on

Presented by Vince Smith at the pro-iBiosphere meeting in Berlin, 21-23 May 2013. …

Presented by Vince Smith at the pro-iBiosphere meeting in Berlin, 21-23 May 2013.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
163
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • NHM has huge amount of digital ambition. As an institution we have a new science strategy taking us to to 2017, and “digital” as a concept runs through all most every aspect of that strategy. Just to underscore this, we put it on the front cover of the strategy.
  • This visualisation shows the are 400k specimens across all the departments in the NHM that have good geo-locative data, and the length of the line corresponds to the collecting effort in that spot. Its not the most informative visualisation but the intention is that these globe will grow with more points over time as we digitise. In fact its going to have to grow a lot over the next 5 years.
  • Our science strategy commits us to digitise 20M specimens over the next five year. This will involve an enormous ramping up of effort, given that at the moment we only have about 2.8M records.
  • So my talk today is really about how we are going to ramp up to achieve this 20 M figure, and I have structured this presentation according to the points that we were asked me to speak on. So first off I’ll say a little about the digital content that we already have. How we are creating new digital content; how we are delivering that content; how we are going to link that content up; and finally what is the timeline for doing this work. And I’m going to focus on the digitisation activities and the data portal since these are the parts of this work that I am most closely associated with…
  • So first off then what digital content do we (as an institution) already have. Well in the context of or research this is best represented by the papers we publish. On average the NHM produced about 50 papers a month and about 80% of these have a significant amount of digital data associated with them. However, this content is mostly invisible to the institution. Its only accessible through the papers.
  • Showing all geotagged specimens on a map. You can click one of the specimen records to get an overview of the record. Then click through to see the full record.
  • Shows all the information related to the record. You can also click through to see the data mapped to Darwin Core fields.
  • DataCite
  • Aggregation and access
  • Open by default
  • Data portal timeline
  • Digitisation timeline
  • Questions
  • Example of how we set digitisation priorities
  • The choice of what digitisation granularity we need is linked to the outcome for the data.

Transcript

  • 1. Digitised collections:Toward a digital strategy forfor the NHM, LondonVince SmithWorkshop 3, pro-iBiosphere, Berlin23 May 2013
  • 2. Digital Ambition: NHM Science Strategy 2013-2017A New Voyage of DiscoveryThree Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagementFive Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skillsResources & fundingMeasuring success
  • 3. data.nhm.ac.uk/globe/
  • 4. A New Voyage of DiscoveryThree Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagementFive Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skillsResources & fundingMeasuring successDigital Ambition: NHM Science Strategy 2013-2017Scientific impact 1,000 papers in leading journalsDigital access 20M specimens available digitallyEngagement 1M face-to-face engagementsCollections Globally important collectionsDiagnostic tools Diagnostic tools for key groupsDeep time Timeline of key transitionsScience & society Articulate of the role of scienceUK network Act as a national museumEarth sciences Earth Sciences CentreFunding £10M for Five Challenge Areas
  • 5. Overview1. Existing digital content, sources & formats• Research data• Collections data2. Making collections data digital• Priorities• Protocols & pathfinder activities• Crowdsourcing transcription3. Aggregation & delivery• The NHM data portal• Data visualisation, data sub-portals4. Identifiers, links & interoperability• DataCite DOIs• Third party aggregators• Portal API’s, download & analytical functions5. Timeline & constraints• Data policies• Next stepsDigitisationactivitiesDataportal
  • 6. NHM Research Outputs• 49 papers, 45 available online(4 print only or behind pay walls)• 9 had supplementary data files• 39 papers with tables, charts & other datao >1000 sequenceso 826 figureso 76 tableso 1 genome• No collective view of these data (37 journals)• No consistent way of citing NHM data• No consistent mechanism to access data• Effectively invisible at the institutional levelOne Month of NHM Science group papersData via Carolyn Lowry e-mail, 13th Feb. 20131. Existing digital content
  • 7. NHM Collections Outputs: data• Huge investment in NHM collection management system• ≠ Imaging• Most research projects need spatio-temporal records• Different requirements for different purposesNHM COLLECTIONS April 2013Collection areaEstimate no ofspecimensNo. records indatabase% collection indatabase% records withlocation infoBotany 6,000,000 626,000 ~ 10% 96%Entomology 32,000,000 316,000 <1% 68%Mineralogy 500,000 422,000 ~ 95% 79%Palaeontology 9,000,000 342,000 ~ 3% 89%Zoology 28,000,000 1,131,000 ~ 60% via lots) 69%TOTAL 76,000,000 2,837,000 3% (23% )1. Existing digital content
  • 8. • Many, many imaging projects (highly fragmented)• Circa 40 TB for major collections (excluding library)• 120,000 images in KE EMu (many others not in KE!)• Circa 250,000 via NHM Photo unit (limited metadata)Collection area No. image files Disk spaceBotany 140,133 35,302Entomology 529,106 3,172Mineralogy 14,000 6Palaeontology 122,548 993Zoology 12,975 1,598TOTAL 818,762 41,070NHM Collections Outputs: images1. Existing digital content
  • 9. Current data formats• Darwin Core Archive (DwCA) & extensions (collections)• Circa 2020 fields mapped to 50 fields to generate archive• Images mainly JPG & TIFF• Metadata using EML & Genesis II standard• Research data files in a wide array of formats (blob files)Nexus (character data and Newick formattedphylogenetic trees)Non-NHM specimen lists (as Darwin CoreArchive files)PhyloXML (an XML standard for representingphylogenetic trees)Output from the Imaging and Analysis Centre(Micro CT datafile formats)NeXML (an XML standard for representingcharacter data)Collections of images from digitisation projects(as a collection of links or a zipped archive)Sequence trace files (.scf sequencechromatogram format files)Environmental sequence filesTaxon checklists (as Darwin Core Archive files) Collection level descriptions1. Existing digital content
  • 10. • Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)2. Making collections data digitalDigitisation Priorities
  • 11. • Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)• Linked to strategic collaborations & financial opportunitieso e.g RBG Kew, RBG Edinburgh, Nat. Mum. Wales, Hunterian etc.• Priorities dictate order – we plan to do it all (eventually)!2. Making collections data digitalDigitisation Priorities
  • 12. • Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)2. Making collections data digitalDigitisation Protocols
  • 13. • Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)• Pathfinder activities for less well understood projectso Entomological dry material (30 M specimens)- iCollections (specimen-by-specimen) approach- SatScan (drawer level multi-specimen) approach2. Making collections data digitalDigitisation Protocols
  • 14. • Specimen-by-specimen, traditional, dedicated 6 person team• Digitising British Isles Lepidoptera collection• ~500,000 specimens, 5,000 drawers• Re-curation & specimen imaging• Complete label information including georeferencing• For use in Climate Change initiative2. Making collections data digitaliCollections Initiative
  • 15. • 4-6 people over 3 years, work broken into small tasks by teams• Average imaging rate 163 specimen/day*person• Averaging >3min per specimen (prep., imaging & databasing)• >£1/specimen• BUT: 6,800 person years for the entire collection2. Making collections data digitaliCollections Initiative
  • 16. • Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated2. Making collections data digitalSatScan Initiative
  • 17. • Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated2. Making collections data digitalSatScan Initiative
  • 18. • Dedicated specimen-level rapid annotation software2. Making collections data digitalSatScan Initiative
  • 19. Crowdsourcing & Transcription• We have a massive transcription problem• Experiments via Notes-from-Nature (a Zooniverse project)• Transcribing the NHM ornithological accession registers• Wikimedian in Residence (Wikisource transcription)• 4 Month project, including specimen label transcription2. Making collections data digital
  • 20. data.nhm.ac.uk• A focus for deposition and discovery of major NHM data sets• Promote innovation though re-use of museum data• Open Access, at a dedicated subdomain of the NHM website• Started Jan. 2013 (3 years), consultation throughout 2012NHM Data PortalFunctionalcomponentsof the dataportal3. Aggregation & Delivery
  • 21. SearchDatasetsmatchingcriteriaIndividual datasetResultsBrowse &searchcriteriaAdvanced displayoptions• Dataset registry, for dataset discovery, modeled on data.gov.uk• Uses CKAN, an open-source data portal software platform3. Aggregation & DeliveryNHM Data Portal: Registry
  • 22. Metadataabout thedatasetNameGeographicscopeTags“Social”AuthorsLicenseDownloadDeveloper toolsTechnicalInfo.(extractedfrom datafile)• Dataset metadata discovery3. Aggregation & DeliveryNHM Data Portal: Registry
  • 23. • Simple datasets upload workflow for non-collections data1. Name thedataset 2. Upload / linkthe data file3. Describe thedata file4. Theme &tag5. Add additionalresources6. Temporalcoverage7. Geographiccoverage8. Save & finish3. Aggregation & DeliveryNHM Data Portal: Dataset upload
  • 24. ZoomablemapAppliedfiltersToggle map, table &stats viewsSearch, download &display optionsNo. recordsNo.Georef.records• Dedicated interface to visualise & explore major datasets• Focused on collections data, based on Canadensys.net, uses CartoDB3. Aggregation & DeliveryNHM Data Portal: Data visualisation
  • 25. Collections viewsStatisticalsummarySpecimen record viewsData fieldmappingsSummarypreviewFullrecordTablesDownload3. Aggregation & DeliveryNHM Data Portal: Data visualisation
  • 26. • Using DataCite DOIs in the data portal• datasets (2014) & specimens (2015)• Unique, persistent and resolvable identifiers• Easy to cite, alias existing specimen identifiers• Conform to minimum DataCite requirements• Landing page, min. metadata standard, fee, min. 10 yr. contract, DOI (pre)fixesNHM Data Portal & DataCiteBreaks us out of the biodiversity data silo4. Identifiers, links & interoperability
  • 27. • Content within the NHM data portal will be highly accessibleo Collections harvestable (e.g. by GBIF as a DwCA)o Download DwCAs on any search faceto Wide set of API’s available of datasets (part of CKAN)• Sub-portals (selected content, themed by topic)o e.g Virtual Herbarium, NHM Science initiatives, geographic regions• Analytical interface planned for 2015 (but not specified)Data Aggregation, APIs & download4. Identifiers, links & interoperability
  • 28. • Data portal will be “open-by-default”• Ambiguity in what this means & top down schizophrenia• Conflicting mandates on open access & revenue opportunities• Lots of guidance available, will use to form a common policy• A cross institutional policy would be useful (but challenging)Data Policies & Next Steps5. Timeline & constraints
  • 29. Jan 2013 Jan 2014 Jan 2015 Jan 2016Requirements& dataset discoveryPrivate alpha Stable publicbetaFull release &sub-portalsInternal feedback, datavisualisation & DOIsSubportals &analytical toolsProject startNHM Data portal timelineNext 6 months• More documentation (PID and Tech Spec)• Consultation and advocacy (internal and external)• Data mapping from KE EMu and software testing• Developmento website wireframe designo drafting data visualisation subcontracto Construction of private alpha release5. Timeline & constraintsData Policies & Next Steps
  • 30. Jan 2013 2014 2018Path-finding &ProgrammedevelopmentPrivate alpha Stable publicbeta20 Million!!Project startNHM digitisation timelineNext 6 months• Initial conclusions from path-finding digitisation activities• Initial grant funding bids developed• Advocacy, outreach & development of a digitisation “programme”• Investigate possibilities for gallery development• Develop crowdsourcing strategy2015 2016 2017Major fundingapplications &a new gallery?Digitisie… Digitisie… Digitisie…5. Timeline & constraintsData Policies & Next Steps
  • 31. QUESTIONS
  • 32. Digitisation Priorities• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.0100200300400500600700Crop Wild Relatives (accepted taxa only)2. Making collections data digital
  • 33. • Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.• Tiered approach, different needs for different collectionsNick Poole, UK Collections Trust2. Making collections data digitalDigitisation Priorities