The document discusses plans to create a new data portal at data.nhm.ac.uk to address issues with finding, accessing, citing, and integrating research data and collection data from the Natural History Museum. It will provide a central access point, allow for integrated search and browse of datasets, and enable users to download, export, and analyze data. The portal will follow an open by default approach and be populated by museum staff. Development will occur over three years with initial focus on discovery of research datasets and collections data, followed by improved visualization and citation of data.
Talk about Exploring the Semantic Web, and particularly Linked Data, and the Rhizomer approach. Presented August 14th 2012 at the SRI AIC Seminar Series, Menlo Park, CA
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
Talk about Exploring the Semantic Web, and particularly Linked Data, and the Rhizomer approach. Presented August 14th 2012 at the SRI AIC Seminar Series, Menlo Park, CA
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
Bibliography of Life: Comprehensive services for biodiversity bibliographic r...Vince Smith
King, D., Sautter, G., Morse, D., Penev, L., Biserkov, J., Georgiev, T., Roberts, D., Smith, V. Bibliography of Life: Comprehensive services for biodiversity bibliographic references (POSTER). TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
Assisted restructure of web content for paper-based presentation: a look at w...Vince Smith
Heaton, A., Rycroft, S., Baker, E., Bouton, K., Scott, B., Koureas, D., Livermore, L., Roberts, D., Smith, V. 2013 Assisted restructure of web content for paper-based presentation: a look at workflows and data representations. TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
A presentation given by
Daphne Duin and co-authored with David Self, Simon Rycroft, Dave Roberts & Vincent Smith at the EDIT general meeting, Carvoeiro, Portugal. Dec. 15-17, 2009.
FP7 Funded RI Project experiences: some overly honest tips from a project coo...Vince Smith
Smith, V.S. 2014. FP7 Funded RI Project experiences: some overly honest tips from a project coordinator, EC Horizon 2020 Research Infrastructures Information Day in at the Natural History Museum London, U.K. 18 June 2014.
A short talk at the iEvoBio (Informatics for Phylogenetics, Evolution, and Biodiversity) conference at the The University of Oklahoma, Embassy Suites Hotel and Conference Center, Norman, Oklahoma, USA. June 21-22, 2011.
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
No Free Lunch: Metadata in the life sciencesChris Dwan
This presentation covers some challenges and makes suggestions to support the work of creating flexible, interoperable data systems for the life sciences.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
Presentation by Dr Sarah Butcher, Imperial College London at Science and Engineering South (SES) Event - Helping Researchers Manage their Data - Friday 9th May 2014 held at Imperial College London
Bibliography of Life: Comprehensive services for biodiversity bibliographic r...Vince Smith
King, D., Sautter, G., Morse, D., Penev, L., Biserkov, J., Georgiev, T., Roberts, D., Smith, V. Bibliography of Life: Comprehensive services for biodiversity bibliographic references (POSTER). TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
Assisted restructure of web content for paper-based presentation: a look at w...Vince Smith
Heaton, A., Rycroft, S., Baker, E., Bouton, K., Scott, B., Koureas, D., Livermore, L., Roberts, D., Smith, V. 2013 Assisted restructure of web content for paper-based presentation: a look at workflows and data representations. TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
A presentation given by
Daphne Duin and co-authored with David Self, Simon Rycroft, Dave Roberts & Vincent Smith at the EDIT general meeting, Carvoeiro, Portugal. Dec. 15-17, 2009.
FP7 Funded RI Project experiences: some overly honest tips from a project coo...Vince Smith
Smith, V.S. 2014. FP7 Funded RI Project experiences: some overly honest tips from a project coordinator, EC Horizon 2020 Research Infrastructures Information Day in at the Natural History Museum London, U.K. 18 June 2014.
A short talk at the iEvoBio (Informatics for Phylogenetics, Evolution, and Biodiversity) conference at the The University of Oklahoma, Embassy Suites Hotel and Conference Center, Norman, Oklahoma, USA. June 21-22, 2011.
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
No Free Lunch: Metadata in the life sciencesChris Dwan
This presentation covers some challenges and makes suggestions to support the work of creating flexible, interoperable data systems for the life sciences.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
Presentation by Dr Sarah Butcher, Imperial College London at Science and Engineering South (SES) Event - Helping Researchers Manage their Data - Friday 9th May 2014 held at Imperial College London
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
No specimen left behind: Collections digitisation at the NHM, London*Vince Smith
Presentation on the Natural History Museum, London Digitisation Programme, given at the "Collections for the 21st Century" meeting in Gainesville, Florida, 5-6 May 2014
Scratchpads: the Virtual Research Environment for biodiversity dataVince Smith
Rycroft, S., Roberts, D., Smith, V., Heaton, A., Bouton, K., Livermore, L., Koureas, D., Baker, E. 2013. Scratchpads: the Virtual Research Environment for biodiversity data. TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
Next generation sequencing requires next generation publishing: the Biodivers...Vince Smith
Penev, L., Stoev, P., Komericki, A., Akkari, N., Li, S., Zhou, X., Edmunds, S., Hunter, C., Weigand, A., Porco, D., Zapparoli, M., Georgiev, T., Mietchen, D., Roberts, D., Smith, V. 2013. Next generation sequencing requires next generation publishing: the Biodiversity Data Journal published the first eukaryotic new species with a fully sequenced transcriptome, DNA barcode and microcomputed tomography. TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov.
Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...Vince Smith
Koureas, D., Livermore, L., Roberts, D., Smith, V. 2013. Use it or lose it: crowdsourcing support and outreach activities in a hybrid sustainability model for e-infrastructures – the ViBRANT project case studies. TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
Vince smith-delivering biodiversity knowledge in the information age-notextVince Smith
Smith, V.S. 2013. Delivering biodiversity knowledge in the information age. Hellenic Botanical Society, Thessaloniki, Greece, 3-6 Oct. 2013. [Delivered via video link through Google Hangouts]
Don't make me think: biodiversity data publishing made easyVince Smith
Presented by V. Smith at the 2013 iEvoBio Conference. Part of Evolution 2013, the joint annual meeting of the Society for the Study of Evolution (SSE), the Society of Systematic Biologists (SSB), and the American Society of Naturalists (ASN). June 21-26, 2013, Snowbird Alpine Village, Utah, USA.
Don’t make me think: biodiversity data publishing made easyVince Smith
Presented by Vince Smith at the iEvoBio 2013 meeting in Snowbird, Utah, USA on 25th June, 2013. The presentation coauthors are Alice Heaton, Laurence Livermore, Simon Rycroft and Ben Scott from the Natural History Museum, London, and Lyubomir Penev from Pensoft Publishing, Bulgaria.
2. The problem – research data
Hard to find, access, cite and integrate
• 45 available online
(4 print only or behind pay walls)
• 9 had supplementary data files
• 39 papers with tables, charts & other data
o>1000 sequences
o826 figures
o76 tables
o1 genome
• No collective view of these data (37 journals)
• No consistent way of citing NHM data
• No mechanism to integrate or version
• No way to repurpose data (retyping?)
49 NHM science group
papers in last 4 weeks
Data via Carolyn Lowry e-mail, 13th Feb. 2013
3. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
4. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32
Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40
Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36
Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55
Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34
Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38
5. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
6. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
•23 interfaces & datasets of varying importance
7. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
119 Specimens Up to
28,000,000
Specimens
8. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
9. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
•Library doesn’t have any online collections!
10. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
•Library doesn’t have any online collections!
Bigger issues
•Idiosyncratic browse or search
11. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
•Library doesn’t have any online collections!
Bigger issues
•Idiosyncratic browse or search
•No maps, few images & very slow
•No summary or statistics
•No download, export or custom views
•No integration with other data
•No author info or update info
•No means of specimen citation The data portal must
•No exports to GBIF or associated projects correct these issues
12. The solution – data.nhm.ac.uk portal
High level issues
Functional requirements
•A central access point for NHM research & collections data
•The capacity store/link and describe datasets
•Integrated search & browse of datasets
•The ability to cite datasets and specimen records in data sets
•The ability to integrate collections data
•Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium)
•The capacity to download, export & analyse data
Principles
•Open-by-default: Capacity for embargoed and private data
•Sustainable: Self-populated by NHM staff (except collections data)
Exclusions
•Not a replacement for DAMS or KeEMu (a Web interface for these systems)
•Publications out of scope (focused on data sets)
•All annotations on data link back to the source (e.g. KeEMu)
13. The solution – data.nhm.ac.uk portal
System Overview
Scope File types Registry Subportals
(Source Data) (formats) (Discovery & download) (Branded slices of data)
KeEMu (NHM) Subportal 1
Other
e.g. Disease
initiative
HerbCat (Kew) NHM specimens
DwC-A
PhyloXML
neXML Subportal 2
Nexus e.g. Kew / NHM
Excel, CSV
Other datasets etc… Kew specimens Virtual Herbarium
Species dictionary,
initiatives, Scratchpads etc
Private
User contributed Explorer
datasets Map view Table view Statistics view Analytic view
R
14. Portal overview – adding data sets
Quick & easy, semi-automated workflow
1. Name the
dataset 2. Upload / link
the data file
3. Describe the
data file
4. Theme &
tag
5. Add additional
resources
6. Temporal
coverage
7. Geographic
coverage
8. Save & finish
16. Portal overview – data set display
Exploring research data sets
License
Name Authors
Tags
Download
Metadata
about the
dataset
Technical
Info.
(extracted
from data
file)
Geographic Developer
“Social”
scope tools
17. Portal overview – collections data
Main interface
Toggle map, table Search, download
No. records
& stats views & display options
No.
Georef.
records
Zoomable Applied
map filters
18. Portal overview – collections data
Additional interfaces
Collections views Specimen record views
Tables
Statistical
summary Full
record
Summary Data field
Download preview mappings
19. Portal overview
Some example data portals & software
Data.gov & CKAN
•UK government data portal
•Uses CKAN, open-source data portal platform
•Used by national & regional governments
•Links into Drupal, DataCite & NHM systems
•http://data.gov.uk & http://ckan.org/
Canadensys & CartoDB
•Canadian network of biodiversity collections
•Almost 1 million specimens, 18 datasets
•Uses CartoDB mapping solution
•Create dynamic maps, analyze and build location
aware and geospatial applications
•Widely used, cloud data storage, PostGIS
•http://data.canadensys.net & http://cartodb.com/
20. Portal development
Timeline & resources
Year 1 – Dataset discovery
•Technical & functional specification (Vizz. subcontract)
•Data workflows (KeEMu & research datasets)
•Functional alpha prototype (CKAN)
Year 2 – Visualisation
•Mapping & statistical functionality (CartoDB)
•Social and annotation functions
•Stable beta release at http://data.nhm.ac.uk
Year 3 – Citation & analysis
•DataCite DOIs on datasets & specimens
•Initial Web analytical functions (R)
•Initiative sub-portals including Virt. Herbarium
Resources
•1x Developer (Ben Scott) for 3 years
•Vizzuality subcontract (circa £xxk - TBC)
•ICT capital, travel & software (circa £25k)
21. Portal consultation
Feedback & next steps
Documentation
•Overview specification - http://goo.gl/qjioh
•Project Initiation Document - http://goo.gl/oRr2j
Initial stakeholder meetings (Feb. – May)
•ICT Group (David Thomas, Chris Sleep & Gavin Malarky)
•Darrell Siebert and the KE EMu user group
•NHM Collections Committee & Initiative leaders
•Kew Gardens & Virtual Herbarium Reps.
•GBIF, NBN, UK DataCite team at BL, NERC
•Digital Facility Team
•Vizzuality
FEEDBACK & LINKS
Wider consultation Slides:
•Example data types / sets Feedback: vince+portal@vsmith.info
•Specialist search options & vocabularies Specification: http://goo.gl/qjioh
•Specialist Earth Science needs PID: http://goo.gl/oRr2j
22. Two more things
Wikipedian in Residence
•Four month post with Science Museum
•Starting March / April
•Work with NHM staff to improve Wikipedia
•Run events with NHM staff & volunteers
•Work with the GLAM group at Imperial College
•Focus on NHM science themes & specimens
•Not about promotion of “The NHM”
Biodiversity Informatics Workshop – May 2013
•One full day - date TBC
•Outputs from ViBRANT & e-Monocot
•Includes Scratchpads & the Biodiversity Data Journal
•What we do, how its used and where are we going
•Includes links to NHM informatics & digitisation initiatives
23. Portal overview – data citation
Unique identifiers for datasets & specimen records
Why cite data
•URLs are not persistent
•e.g. Wren JD: URL decay in MEDLINE- a 4-year
follow-up study. Bioinformatics. 2008, Jun
1;24(11):1381-5) – circa 40% decay
•Measure our digital footprint
•Puts research data on par with articles
•Facilitates data mining
What gets an identifier
•NHM specimen records (suffix of NHM ID’s) http://dx.doi.org/BMNH_
•NHM research datasets (files) PBI_00388325
•Insert into publications
How to cite data
•Digital Object Identifiers (DOIs)
•Widely used & understood on articles
•Operates in collaboration with DataCite
•Part of an International consortium
•Mixes NHM data with other domains